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Abstract 

Portfolio methods support the combination of different algorithms and heuristics, including stochastic 
local search (SLS) heuristics, and have been identified as a promising approach to solve computation- 
ally hard problems. While successful in experiments, theoretical foundations and analytical results for 
portfolio-based SLS heuristics are less developed. This article aims to improve the understanding of the 
role of portfolios of heuristics in SLS. We emphasize the problem of computing most probable explana- 
tions (MPEs) in Bayesian networks (BNs). Algorithmically, we discuss a portfolio-based SLS algorithm 
for MPE computation, Stochastic Greedy Search (SGS). SGS supports the integration of different initial- 
ization operators (or initialization heuristics) and different search operators (greedy and noisy heuristics), 
thereby enabling new analytical and experimental results. 

Analytically, we introduce a novel Markov chain model tailored to portfolio-based SLS algorithms in- 
cluding SGS, thereby enabling us to analytically form expected hitting time results that explain empirical 
run time results. For a specific BN, we show the benefit of using a homogenous initialization portfolio. 
To further illustrate the portfolio approach, we consider novel additive search heuristics for handling 
determinism in the form of zero entries in conditional probability tables in BNs. Our additive approach 
adds rather than multiplies probabilities when computing the utility of an explanation. We motivate the 
additive measure by studying the dramatic impact of zero entries in conditional probability tables on the 
number of zero-probability explanations, which again complicates the search process. We consider the 
relationship between MAXSAT and MPE, and show that additive utility (or gain) is a generalization, 
to the probabilistic setting, of MAXSAT utility (or gain) used in the celebrated GSAT and WalkSAT 
algorithms and their descendants. Utilizing our Markov chain framework, we show that expected hitting 
time is a rational function - i.e. a ratio of two polynomials - of the probability of applying an additive 
search operator. 

Experimentally, we report on synthetically generated BNs as well as BNs from applications, and 
compare SGS’s performance to that of Hugin, which performs BN inference by compilation to and 
propagation in clique trees. On synthetic networks, SGS speeds up computation by approximately two 
orders of magnitude compared to Hugin. In application networks, our approach is highly competitive in 
Bayesian networks with a high degree of determinism. In addition to showing that stochastic local search 
can be competitive with clique tree clustering, our empirical results provide an improved understanding 
of the circumstances under which portfolio-based SLS outperforms clique tree clustering and vice versa. 
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1 Introduction 


In this article we study the problem of computing a most probable explanation (MPE) in Bayesian networks 
[72] . Informally, an MPE is an instantiation of all non-evidence nodes in a BN such that no other instantiation 
has greater probability. MPE computation is a problem that is common to probabilistic formulations 
of diagnosis, image processing, error correction decoding, and genetic linkage analysis [22,23,49,82,83]. 
Ideally, one would prefer to use exact algorithms for MPE computation - for example algorithms such 
as clique tree clustering [2,45,80], conditioning, [37,71,72], variable elimination [47,86], or branch-and- 
bound [51-53]. However, the MPE problem is computationally hard [1,81], and this hardness manifests 
itself in slow execution times even in relatively simple networks that are used in current applications. The 
complexity of structure-based algorithms depends on the treewidtlr of a BN’s underlying graph or the optimal 
maximal clique size of a BN’s induced clique tree [3,16,18]. Due to their very large treewidtlrs or optimal 
maximal clique sizes, exact algorithms have proven to be infeasible or impractical in many application BNs. 

Because of the limitations of exact algorithms, along with the importance of the MPE problem in appli- 
cations, the development of improved algorithms for MPE computation is of great interest. Stochastic local 
search (SLS) algorithms have proven to be competitive in solving computationally hard problems including 
satisfiability (SAT) [27,35,76,78,79], the most probable explanation [39,41,48,55,61], and the maximum a 
posteriori (MAP) hypothesis [68,69]. Unfortunately, the theoretical understanding of SLS algorithms has 
been lagging [33], and despite recent progress [32,58,63] it is clear that further advances are needed. This 
work is part of a larger research effort where the ultimate goal is the development of highly adaptive but 
well-understood SLS algorithms, including SLS algorithms for MPE computation in BNs. Progress on such 
adaptive SLS algorithms has already been made, for example in the areas of adaptive noise [20,31,54] and 
learning predictive models [36,43,74,85]. In these adaptive SLS algorithms, there is a need to search in the 
space of SLS search parameters in addition to the fundamental SLS search process for an MPE. In other 
words, there is a two-level or two-step process: object-level SLS search, and metal-level search for approxi- 
mately optimal SLS parameters that control the object-level SLS search process. To make the vision of this 
two-level or two-step process a reality, we believe that an improved understanding of the object-level search 
space is essential, hence we set in this paper out to provide such improved understanding in the context of 
portfolio methods. 

Portfolio methods support the combination of a wide range of different algorithms, and have been identi- 
fied as a promising research direction [25,26,38,39,61,85]. This article provides an improved understanding 
of the role of portfolios of heuristics in stochastic local search. Algorithmically, we introduce a portfolio- 
based stochastic local search approach which utilizes an initialization portfolio and a search portfolio. Our 
approach is implemented in the Stochastic Greedy Search (SGS) system; two SGS algorithms called SlM- 
PLeSGS and OperatorSGS are presented in this article. The OperatorSGS algorithm is a portfolio- 
based [25,38,61,85] approach to MPE computation using SLS. It provides a flexible and general framework 
for stochastic explanation initialization and search. Specifically, OperatorSGS allows us to combine dif- 
ferent initialization operators (or initialization heuristics) and different search operators (greedy and noisy 
heuristics). Given our portfolio approach, one does not need to take a winner-takes-all approach to these 
different heuristics or operators. Instead, one can combine (and eventually adaptively tune) them according 
to the problem instance and application-specific requirements at hand. Within this portfolio framework, we 
make progress related to the use of a variety of different initialization and search algorithm operators. We 
introduce a novel augmented random walk model (Definition 54) and show that it induces a Markov chain 
(Theorem 55), thereby enabling us to analytically form expected hitting time results that parallel empirical 
run time results [58]. For a specific BN, we show the benefit of using a homogenous initialization portfolio 
(see Definition 56 and Theorem 58). 

To illustrate the portfolio approach, we consider novel heuristics for handling determinism in BNs. 1 
Our approach, which we here carefully relate to MAXSAT, adds rather than multiplies probabilities when 
computing the utility of an explanation, and we therefore call it additive utility. Quantitatively, we study the 
impact of zero entries in CPTs on the number of zero-probability explanations, and show dramatic increases 
in the probability of a randomly picked BN explanation being zero as a function of the probability of a CPT 
entry being zero, the number of BN nodes, and the BN node state space size (Theorem 13). Gain (i.e. , change 
in utility) functions merely measure the progress made when one BN node’s state is flipped in an explanation. 

1 There are several alternative, very different approaches, to handling determinism. Arithmetic circuits handle determinism 
very well [4,7], as does branch-and-bound that explores an AND/OR search tree [51-53], and one can also carefully encode a 
BN into a weighted MAXSAT problem instance [67,75] and then use a weighted MAXSAT solver [19,29,42,44,46,84], Since 
the main emphasis in this article is on the portfolio approach to SLS, with determinism handling serving as an illustration, we 
leave detailed comparison to these alternative approaches to future work. 
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The traditional gain function, multiplicative gain, computes change in an explanation’s probability resulting 
from a flip. We carefully formalize and analyze multiplicative and additive gain. We consider the relationship 
between MAXSAT and MPE, and show that additive utility (or gain) is a generalization of MAXSAT utility 
(or gain) from the GSAT and WalkSAT algorithms and their descendants [35,76,78,79] to the probabilistic 
setting. Both gain functions are utilized in greedy and noisy search operators, which make up the search 
portfolio, where the noisy operators are used to effectively escape local but non-global optima. Let pa 
be the probability of selecting an operator that uses additive gain. Utilizing the Markov chain framework 
mentioned above, we show that expected hitting time h(j>A) is a rational function P(pa)/Q(pa ) of pa, where 
P(pa) and Q{pa ) are polynomials (Theorem 60). 

Empirically, the performance of SGS is compared to that of the state-of-the-art inference system Hugin, 
which implements a clique (or join) tree algorithm [2,14,45]. The clique tree algorithm can be used to compute 
either marginals [45] or MPEs [14], and is among the most well-known inference methods for Bayesian 
networks. We experiment with Hugin and SGS on both synthetic and application BNs. Comparisons to 
Hugin show that Stochastic Greedy Search performs significantly better for certain randomly generated 
Bayesian networks as well as for partly deterministic Bayesian networks from applications. We utilize an 
experimental paradigm for generating hard and easy synthetic BN instances [57,62]. In a bipartite BN, 
let V be the number of root nodes and let C be the number of leaf nodes. Synthetic bipartite BNs of 
increasing hardness can be generated by increasing C while keeping V constant [55,62]. Our results on 
synthetic BNs generated in this way, and where root nodes have uniform distributions and leaf node are 
or-nodes, are as follows: As the C/V - ratio increases, the measured run times of both Hugin and SGS 
increase at an approximately exponential rate as a function of increasing C/V- ratio. However, given a 
suitable measure of gain, and specifically additive gain, SGS is approximately two orders of magnitude 
faster than Hugin. Beyond synthetic BNs, we also found that our algorithm is quite effective on application 
networks with substantial determinism, and in many cases it performs comparably to or better than Hugin. 
In this article we highlight two reasons for SGS’s success in application BNs, namely the ability to exploit 
different search operators in OperatorSGS as well as the use of the additive measure of gain. (Another 
significant component in the success of SGS is the stochastic initialization portfolio, including its dynamic 
programming and forward simulation algorithms [61,63]. The heuristics in the initialization portfolio suggest 
good starting points to the stochastic local search component of SGS; see also [41,68,69].) 

The rest of this article is organized as follows. Section 2 introduces the problem of computing a most 
probable explanation (MPE) as well as related results, definitions, and notation. Section 3 discusses measures 
of utility and gain which form the basis of all local search algorithms, including SGS, and pays special 
attention to the additive approach. In Section 4 we describe the overall structure of our stochastic local search 
approach, SGS. We present two SGS algorithms, namely SimpleSGS and OperatorSGS, and also discuss 
related research. Section 5 discusses analytical results. In Section 6 we turn to the experimental results of 
the article: Section 6.1 presents results for synthetically generated Bayesian networks, while in Section 6.2 
we discuss experimental results for application BNs. Section 7 discusses related work and compares it to 
our two SGS variants. Section 8 concludes and discusses future work. This article extends and revises our 
earlier reports on SGS [55,61]. 

2 Preliminaries 

This section briefly reviews some of the key definitions and results for Bayesian networks and MPE compu- 
tation as they relate to our research. Let (X, W) be a directed acyclic graph (DAG) with nodes X and 
edges W, and let X* £ X. We use the notation n(Xj) = to indicate the parents of X* in the DAG, and 
’L(Xj) = \E <Xi to indicate the children of X*. A Bayesian network (BN), formally introduced in Definition 
1 below, represents a multi-variate probability distribution as a DAG, where the nodes represent random 
variables. 

Definition 1 (Bayesian network) A Bayesian network is a tuple j3 = (X, W, P), where (X, W) is a 
DAG with an associated set of conditional probability distributions P = {Pr(Xi | n^), . . . , Pr(X n | nx„)}. 
Here, Pr(Xj | 1 1 Xi ) is the conditional probability distribution for X* G X. Further, let n = |X| and let Tr Xi 
represent the instantiation of the parents of X* . The independence assumptions encoded in (X , XV) 
imply the joint probability distribution 

n 

Pr(sc) = Pr(xi, ...,x n ) = Pr(Xi = Xl , . . . , X n = x n ) = JJPr(a: i | n Xi )- (1) 

i— 1 
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A conditional probability distribution Pr(A) | Tlxt ) is also known as a conditional probability table 
(CPT). While BN nodes can be continuous, this article is restricted to the discrete case and we will take 
“BN node” to mean “discrete BN node”. Suppose that a BN node X has states {x\, . . . , x m }. We then use 
the notation = £l(X) = {x \, . . . , x m }. For simplicity, but without loss of generality, we often use binary 
nodes in the following, in which case a BN node X has |flx| = |X| = |{0, 1} | = 2 states. 

A BN may be given evidence by clamping some of its nodes to their observed states. An instantiation 
of the remaining nodes is an explanation, formally defined as follows. 

Definition 2 (Explanation) Consider a BN (3 = (X, W, P) with X = {Xi,..., X n } and evidence 
e = {X 1 = X\, X m = x m } where m < n. An explanation x is defined as x = {x m+ \, . . x n } = 
{ — x m -\- \ , • • • , X n — x n } . 

When discussing an explanation x , the BN (3 is typically left implicit. One is often interested in computing 
Pr(a: | e). However, in order to simplify the exposition, we may consider Pr(t/) = Pr(a:, e), where y = x U e, 
instead of Pr(a; | e). This does not fundamentally change the computation since Pr(y) = Pr(a:, e) = 
Pr(a; | e) Pr(e). 

Given evidence, one can perform various forms of BN inference. This article focuses on computing the 
most probable explanation, which is defined as follows. 

Definition 3 (Most probable explanation (MPE)) Computing a most probable explanation (MPE) in 
a BN with evidence e = {Xi = X\, . . X m = x m } is the problem of finding an explanation x* E f2(X m+ i)x 
• • • x Q(X n ) such that Pr (a;* | e) > Pr ( y \ e), where y E f2(X m +i)x • • • x Tl(X n ) is any other explanation 
in the BN. The set of the k most probable explanations is defined as X* = {x\, . . ., x* k } where Pr (x* | e) = 
Pr(** | e) = ••• = Pr (x* k \ e). 

In other words, given evidence e, no other explanation has higher probability than x* E X*; it is 
sometimes convenient to include e into an MPE and consider y* = x* U e to be an MPE. Note that there 
might be many explanations with the same probability, and for this reason we say “an” MPE rather than 
“the” MPE. 

In this article we emphasize the connection between the SAT and MPE problems and now introduce a 
few relevant definitions. 

Definition 4 (SAT, SAT problem) A satisfiability (SAT) formula <j> = (x, q. c ) is defined byV variables 
x = {aq, . . ., xy}, a set of L literals q = {q\, . . qx}, where qi = x or qi = x for x E x, and C distinct 
clauses c = {ci, . . Cc}, where each clause consists of literals combined using the or (“\/”) connective. The 
satisfiability (SAT) problem is to determine whether there is a truth assignment t: x — > {0, 1}' that makes 
Cj = 1 for all 1 < j < C. Such an assignment is called a satisfying (truth) assignment (or model). 

SAT is a celebrated NP-complete decision problem, and the SAT problem and algorithms for solving it 
are of central importance both in theoretical computer science and in artificial intelligence [27,65,77,79]. In 
a SAT problem, we have c* = 1 if the i-th clause is satisfied; Cj = 0 if it is not satisfied. The SAT utility 
measure is YliL i c i- Consequently, nf=i G = 1 f° r some truth assignment if the formula is satisfiable, else 

XI 

n i=1 G = 0 for all variable assignments. 

The SAT utility measure is not very useful for hill-climbing purposes, since it does not distinguish between 
few and many satisfied clauses. The MAXSAT utility measure is therefore often used, and one considers the 
MAXSAT optimization problem. 

Definition 5 (MAXSAT, MAXSAT utility) Let (j> = (x, q, c) be a SAT formula where C = \c\. The 

MAXSAT utility measure Us of f, given a truth assignment r, is the number of true clauses in <p: Us (r) = 
\{ci E c | Cj = 1 and 1 < i < C}|. The maximum satisfiability problem (MAXSAT) is to find a truth assign- 
ment t* such that the number of true clauses Us (r) in (j) is maximized. 

MAXSAT defines an optimization problem where one optimizes the number of satisfied clauses, X^i=i G- 
Note that i G = C for some truth assignment if the formula is satisfiable, else X^iLi G < C f° r a ll variable 
assignments. If G — C is reached in MAXSAT optimization, one has also solved the underlying SAT 
problem. Since SAT is NP-complete, clearly MAXSAT is NP-complete as well, however MAXSAT is much 
more useful for hill-climbing. A generalization of MAXSAT exists, called weighted MAXSAT, in which each 
clause Ci has a weight Wi, and the task is to optimize the weight over all clauses. Weighted MAXSAT 
optimization and MPE computation are closely related; see Section 7. 
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A key contribution in this article, see Section 3, is how the MAXSAT utility measure can be generalized 
to an additive BN measure which is then utilized in MPE search. In order to set the stage we introduce 
these definitions. 

Definition 6 (SAT-equivalent, SAT-like) A BN (3 is SAT- equivalent to a SAT formula (f> (and vice 
versa) if there is a one-to-one mapping between (3 and if. A BN (3 is SAT-like if there exists a formula <f 
that is SAT-equivalent to (3. 

The constructions underlying Definition 6 are well-known [8, 73, 81] and we do not discuss details in 
this article. Along similar lines, one can define an explanation s in a BN (3 as SAT-equivalent to a truth 
assignment t in a formula <f>, where (3 is SAT-equivalent to if and there is a one-to-one mapping between 
the individual assignment of states to nodes in x and the assignments of truth values to individual variables 
in r. Here, a special case of great interest occurs when an MPE x = x* is SAT-equivalent to a satisfying 
assignment t — t* . 

We now introduce a few additional SAT-related concepts. Below, r maps one variable x to {0, 1} while 
t maps all V variables x in (f> to {0, 1} V . The notation t(x) means that variable ads truth assignment r( x) 
is inverted; r( x) = 1 iff t(x) = 0 and t{x) = 0 iff r(x) = 1. 

Definition 7 In a SAT problem instance <f> = (x, q. c ) with truth assignment r, let x £ x. The notation 
(often abbreviated f) is a new truth assignment with f(x) = r( x) and where for all y £ x — {x}, 
f(y) = r(y). We say that r((f>, x,t) is flipped compared to r. 

SAT is relevant to the study of BNs for several reasons. First, BNs often have many deterministic nodes, 
of which the or- and and-nodes found in SAT are special cases. For examples of application BNs with many 
deterministic nodes, we refer to Figure 3. Second, significant progress on stochastic local search to solve the 
satisfiability problem has been made in recent years [27,35,76,78,79]. Of particular relevance to our work is 
(i) the use of the GSAT measure of gain, which is based on the MAXSAT utility measure, and (ii) the use 
of noise as a mechanism of escape from local optima. In the research reported here, we have extended the 
MAXSAT utility measure and the GSAT measure of gain to the probabilistic setting, giving the additive 
utility measure and the additive measure of gain respectively. 

It can be shown by reduction from SAT that MPE computation is NP-hard [81]. Approximating an 
MPE to within a constant ratio-bound has also been proven to be NP-hard [1]. In fact, the problem is much 
harder, since the evaluation problem is f(P complete [73]. Since inference in BNs is computationally hard 
and the MPE problem is important in applications, we believe that it is important to study SLS algorithms 
for MPE computation, where estimates of x* are computed. 

Definition 8 (MPE (lower-bound) estimate) Let x* be an MPE given evidence e. A best-effoH esti- 
mate of x* is denoted x* ; z/Pr(ai* | e) < Pr (a:* | e) then x* is a lower-bound estimate. 

Generally, lower-bound MPE estimates x* are computed using SLS algorithms. In the remainder of this 
article, our main focus is on SLS algorithms as Las Vegas algorithms [35]. In other words, we typically assume 
that Pr (a:* | e) or Pr(ai*) is a known SLS input parameter, therefore SLS terminates once an x* £ X* has 
been found. Las Vegas algorithms are used in theoretical computer science as well as in SLS research [25,35], 
and enables our scientific study of SLS algorithms as further discussed in Section 5.1. 

SLS algorithms can be analyzed using discrete time Markov chains with discrete state spaces [32,58]. 
Only time-homogenous Markov chains with finite state spaces will be considered in this article. A Markov 
chain is a stochastic process (A t , t > 0) = (A 0 , A\, . . .) induced by a 3-tuple M = ( S , V, V), where S is 
k = |<S| discrete states, a fc-dimensional vector V = (tti, . . ., nj,) represents the initial probability distribution, 
and a k x k matrix V represents the transition probabilities. As we will elaborate in Section 5, Ad is given 
by the objective function, the SLS algorithm, and the SLS algorithm’s parameter settings. 

In A4, some states O C S represent optimal states, and we introduce the following definition. 

Definition 9 (SLS model) Let AA = ( S , V, V) define a Markov chain. Further, assume an objective 
function f : S — * K. and an optimal objective function value f* £ R. that defines optimal states O = {s | s £ S 
and f(s ) = /*}• The 2-tuple (AA, O) defines an SLS model. 

The objective function / and the optimal states O are independent of the SLS algorithm and its pa- 
rameters; finding an s* £ O is the purpose of SLS search including MPE computation. We emphasize 
maximization in the form of MPE computation in this article, therefore / in Definition 9 is given by Pr ( x ) 
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in Definition 1. We often consider BNs with binary nodes. In this case, explanations can be represented as 
bitstrings of length n, b € {0, 1}”, and s* = b* € {0, 1}". 

Consider an SLS model (M,0). A hitting time analysis of M, known from Markov chain theory, gives 
the expected number of search steps needed to reach s* € O. Hitting times are based on first passage times. 

Definition 10 (First passage time) Consider an SLS model (M,0) and let Si € S where S is M’s 
states. The first passage time T, a random variable, into s* £ O is given by T = min(j > 0 : Aj £ O}. The 
expected value ofT, given initial state Aq = Si, is defined as 

mi,o '■= E(T \ A 0 = Sj). 

Definition 10 covers \0\ > 1 and thus includes first passage time into multiple optimal states. For 
simplicity we emphasize the one-state case \0\ = 1 here, with O = {s*} = {sk}- First passage time now 
simplifies to T = min(j > 0 : Aj = s*,}, and simplifies to m^fc. 

Using conditional expectations, one obtains from Definition 10 the following well-known result. 

Theorem 11 (Expected hitting time) Let M be a Markov chain with state space S = (si, . . ., s*,} and 
first passage time T (into s* = Sk). The expected hitting time h is then 

k k 

h ■■= ')T j E(T I A 0 =i) Pr(A 0 = *) = JZ m * 7r *' ( 2 ) 

i—1 i=l 

Expected hitting time can be used to analyze the expected time to reach an optimal state s* € O in 
an SLS model (M, O), and is often closely related to the observed mean run time for an SLS algorithm. 
For SLS, the hitting time h is with respect to some state in O and depends on the algorithm’s input 
parameters including the problem instance. Previously, we have studied hitting time h(pjsr), where pn is the 
noise probability [58]. By varying pjy, we have constructed expected hitting time curves that are analytical 
counterparts to experimental noise response curves [58]. These expected hitting time curves provide an 
analytical foundation for finding optimal noise level p* N . This noise level is optimal in the sense that it 
minimizes expected hitting time, or p* N = argmino< p <i h(p). 


3 Measures of Utility and Gain 

Stochastic local search algorithms are based on evaluating the absolute goodness (utility) of an explanation 
and relative goodness (gain) of an explanation compared to explanations that are neighbors in the search 
space. Gain is, informally speaking, change in utility from one explanation x to another explanation x' . The 
explanation x' may be derived from x , for example by changing (flipping) one node’s state. 

Measures of utility and gain play essential roles in any local search algorithm, including SGS, and this 
section focuses on these matters. We first present measures of utility in Section 3.1, then measures of gain in 
Section 3.2. We highlight two different approaches, the multiplicative approach and the additive approach. 
While the multiplicative approach is natural, the additive approach is related to MAXSAT and turns out to 
give excellent performance in partly deterministic BNs. The basis for how the measures of gain are utilized 
in stochastic local search is described in Section 3.3. 

Readers interested in our portfolio approach but not in our approach to handling zeros in BNs may want 
to skip this section. 

3.1 Measures of Utility 

A utility measure (or function) is also known as a cost function, an energy function, a fitness function, or an 
objective function. We introduce the notion of a utility measure U(x), where x is an explanation in a BN. 

3.1.1 Multiplicative Utility 

As one might expect, the definition of a joint probability from Equation 1 is used as a utility measure in 
SLS. For purposes of notational convenience and uniform treatment in our SLS algorithms we also consider 
it a multiplicative utility Um- 
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Figure 1: The probability of a randomly picked BN explanation x being zero, or y = Pt(Um(x) = 0), as a 
function of varying BN size n (along the axaxis), CPT entry zero probability p, and BN node state space size 
w. The curves are for varying p, with p = 0.1 (curve indicated by blue diamonds), p = 0.5 (curve indicated 
by black solid line), and p = 0.9 (curve indicated by green dashed line). The BN node state space size is 
different in these two panels. Left panel: State space size w = 2; R.igh. panel: State space size w = 8. 


Definition 12 (Multiplicative utility) Let x = ( y , e) be an explanation in a BN with n nodes and 
evidence e. The multiplicative utility is defined as 

n 

U M (x) = Pr(ai) =JJp r (a; i | tt {X f)). (3) 

i= 1 

While multiplicative utility is a natural utility measure, it does have limitations, in particular in so- 
called partly deterministic BNs. These are BNs with many zeros in their CPTs and consequently many 
explanations x with Um(x) = Pr(a:) = 0. 

To obtain further insight into partly deterministic BNs, suppose that zeros are assigned CPT entries in an 
existing BN according to the following randomized algorithm. The algorithm is controlled by a probability 
parameter p, where 0 < p < 1 (see [21] for details). With the exception of the last entry in a CPT column, p is 
the probability that a CPT entry is set to zero. Specifically, p controls the setting of zeros in the CPT entries 
of a BN node in the following manner. We assume a random number generator Random(o, b) that generates a 
real number r, where a < r < 5, uniformly at random. Let V be an arbitrary non-root node in the BN, and let 
y be an arbitrary instantiation of parent nodes Y = II(R). The algorithm sequentially assigns probabilities 
to V’s states, given y , and distinguishes these two cases. Case 1 is for a state Vi £ f ly that is not the last 
state of V: If Random( 0, 1) < p then put Pr(R = V{ Y = y) = 0, else put Pr(R = Vi \ Y = y) > 0. Case 2 
is for the state v\ v \ £ fly that is the last state: Let Pr(V = viy\ \ Y = y) = 1 — X^!=i _1 Pr(^ = w* | Y = y). 
The situation where V is a root node in the BN is handled in a similar manner. 

For the above randomized algorithm, the following theorem holds. 

Theorem 13 Consider a BN f3 with n nodes, with w = I2(Vl) = ••• = fl( V n ), and where CPTs are set 
according to the above randomized algorithm. Let x be an explanation picked uniformly at random in ft. 
Then n 

Pr (U M (x) = 0) = 1 - (l - x p) . (4) 

This theorem is a slight variation on a result in [21]. Figure 1 illustrates (4), and clearly shows the rapid 
speed at which the probability Pt(Um(x) = 0) grows, for an explanation x, with growing BN size n and 
probability parameter p. The slowest growth for Pr (Um(x) = 0) in Figure 1 is for w = 2 and p = 0.1, where 
for n = 10 we have Pt{Um{x) = 0) « 0.4. The fastest growth, on the other hand, takes place for w = 8 and 
p = 0.9, where already for n = 10 we have Pr (Um(x) = 0) ss 1.0. In other words, almost all explanations 
picked uniformly at random have zero probability or Um(x) = 0. Since most application BNs of interest 
contain substantially more than n = 10 nodes, the effect of zeros in CPTs is often much worse than indicated 
by these two examples. 
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As presented in Table 3, a large percentage of zeros can often be found in CPTs of application BNs. 
Theorem 13 suggests that in such BNs, if an explanation x is picked uniformly at random, Um{ x ) = 
Pr(a:) = 0 with high probability. Further, by the SAT reduction it is NP-lrard to find an explanation x such 
that Pr(ai) > 0 [73,81]. The case Um{x) = Pr(x) = 0 therefore presents a serious complication for the SLS 
approach to computing MPEs. With a large part of the search space of explanations equal to zero, there is 
often no gradient to hill-climb for an SLS algorithm using Um- This limitation of the multiplicative measure 
leads us to the additive approach, which we discuss below. 2 3 

3.1.2 Additive Utility 

Due to the limitations of multiplicative utility, we now consider a complementary utility measure, namely 
additive utility U A [55,61]. This utility measure is based on the MAXSAT utility measure, which was used 
in the seminal GSAT family of algorithms [78,79]. 

Definition 14 (Additive utility) Let x = ( y , e) be an explanation in a BN with n nodes and evidence e. 
The additive utility is defined as 

n 

Ua{x) = y^Pr(a?j | (5) 

i — 1 

Additive utility clearly avoids multiplicative utility’s problem with zeros as discussed above in Section 
3.1.1. The detailed investigation and use of the additive utility measure in BN inference is, to our knowledge, 
a novel aspect of our research as reported in this article. 

We provide formal and experimental arguments in support of additive utility and gain in the following. 
First, as shown in Theorem 15 below, additive utility is a generalization of MAXSAT utility in the following 
sense. MAXSAT utility adds the number of satisfied CNF clauses [77,78], given a truth assignment to 
the logic variables. In a SAT-like BN, additive utility adds the probabilities of root and leaf nodes, given 
an instantiation of root nodes (which correspond to SAT variables) and clamping of the leaf nodes (which 
correspond to SAT clauses) to 1. Second, additive utility is an approximation to multiplicative utility. 
(Formal results concerning the relationship between additive gain and multiplicative gain are provided later 
in Section 3.) Third, it turns out that additive utility and gain give excellent empirical results in certain 
BNs as reported in Section 6.2 and Section 6.1. 

Theorem 15 (Generalization of MAXSAT) Let f = (y, q, c) be a SAT problem instance with \y\ = V, 
|c| = C, and n = C + V . Let f3 = (X,W,P) be a Bayesian network that is SAT -equivalent to <j> and with 
leaf nodes clamped, i.e. e = {X± = 1 ,...,Xc = 1}- Let t be a truth assignment to the logic variables 
y = {yi, . . . ,yv} in 4> and let x = {Xc+i = x c + 1 , . . . ,X n = x n } be an equivalent explanation over the V 
root nodes in (3. Then Us (r) = k if and only if 


U A {x) = k + j. (6) 

Proof. Case (<=): Consider a SAT- equivalent BN /3 with an explanation x , and suppose that U A (x) = 
k + Now suppose that (3 has n nodes {Ad, . . . , X n }, and form 

n 

U A (x) = ^2 Pr (^i = x i I TTpQ))- 

i=l 

Since /3 is SAT-equivalent to some formula <j>, ft is bipartite with V root nodes (corresponding to variables 
in f) and C = n—V leaf nodes (corresponding to clauses in </>). We decompose the additive utility U A {x) 
accordingly to 

C n 

U a (x)=J2 PrpQ = ^ U(W)) + Y, Pr (Xi=Xi), (7) 

i=l i=C + 1 

2 In addition, this analysis underlines the importance of related work on handling determinism, for example by means of 
compilation into arithmetic circuits [4] or by encoding BNs into weighted MAXSAT problem instances and then apply weighted 
MAXSAT solvers [67,75]. 

3 Recall that JJ S is MAXSAT utility; see Definition 5 for the formal definition. 



where {Xq+i, • • • , X n } are root nodes and have no parents. By construction, the V = n — C root nodes are 
binary and each root node state has probability mass | [8,73,81]. Since there are V = n — C of them we 
obtain for the second term in (7) 

n y y 

Pr (^i = x i) = Yl Pr (^'+C = X 3+C ) = j- (8) 

i=C+ 1 j = 1 

Clearly, (7) can be written as 

C n 

y^Pr(Xj = Xj I n(Xi)) = Ua{x) - ^ Pr(Xj = Xj) = k, (9) 

i- 1 i-C+1 

where the last equality follows from Ua(x) = k+ j- and (8). Now, in (3, C leaf nodes are being clamped as 
follows: e = {Xi = 1, . . ., Xc = 1}. Without loss of generality we assume a suitable ordering, and from (9) 
we know that for explanation x the leaf nodes in (3 look like this: 

Pr(*! = 1 | 7T Xl ) = 1, • • • , Pr(X fc = 1 | ttxJ = 1, Pr(X fc+1 = 1 | n Xk+1 ) = 0, . . . , Pr(X c = 1 | ir Xc ) = 0; 

Due to the equivalence between explanations in (3 and truth assignments in <j>, there is an assignment t for 
<f> that is SAT-equivalent to explanation x. And from the above it follows that exactly k < C clauses are 
satisfied in <j> under r, or Us(t) = k. Case (=>) is similar, giving the desired result. ■ 

While the MAXSAT measure Us applies to the variables in a SAT formula, the additive measure Ua 
applies to the full explanation in a BN, with BN nodes in a SAT-like BN corresponding to both variables 
and clauses in the SAT-equivalent formula. More importantly, additive utility is not restricted to the logical 
setting as MAXSAT is. The additive measure Ua can therefore be applied in all BNs, not only in SAT-like 
BNs. 

Even though the Ua measure clearly does not compute an explanation’s probability, it is very useful 
for the special but important case of deterministic BN nodes. As already noted, surprisingly many BNs 
from applications have substantial deterministic fragments. Examples of such partly deterministic BNs are 
SAT-like BNs as discussed above (with or- nodes), error correction decoding BNs (with xor-nodes) from 
information theory [22,49], and other BNs from applications — see Table 3. 

3.2 Measures of Gain 

For an explanation x ' , a measure of gain, say A U, simply measures change in utility U compared to another 
explanation x. When a node X is flipped in an explanation x , it has a localized impact in the BN at hand. 
The following definition introduces a vocabulary for such localized changes. To simplify the exposition, we 
often assume = 2 in the remainder of this section; this can easily be generalized to > 2. 

Definition 16 (Flipped node, flipped explanation) Consider an instantiated BN node Xj = Xj. If one 

puts Xj = Xj, where Xj = 0 if Xj = 1 and Xj = 1 if Xj = 0, then the node is flipped (from Xj = Xj ). An 
explanation x' is flipped from another explanation x if at least one node Xj has Xj = Xj in x while Xj = Xj 
in x' . 

We note that the terminology “flipping” does not imply that CPTs are being changed — we are not 
performing machine learning but rather search. 

In the following definition, the notation x [Xj = Xj] means that the current state of the BN node Xj is Xj 
in the explanation x; no changes are made. The notation x' = x[Xj <— Xj], on the other hand, means that 
Xj (which was assigned state Xj in x) is now, after the assignment Xj <— Xj, assigned state Xj in x' . This is 
similar to assignment in programming languages. 

Definition 17 (One-flipped) An explanation x' = {X m+ i = x m +i,...,Xj = Xj, ...,X n = x n } is one- 
flipped from an explanation x = {X. m+ i = x m +i, . . . , Xj = Xj, . . . , X n = x n } if exactly one node, here the i-th 
node, is flipped. For one-flipped explanations the notation x' = x[Xj <— xy], where x' is the new explanation, 
is also used. A one-flipped parent instantiation, denoted tt x , means that some node Y £ II.y has its state 
flipped from y to y in X ’s parent instantiation n x . 

At the core of SLS algorithms is an evaluation of whether the state of a node Xj should be flipped from 
Xj to Xj, leading to a corresponding one-flipped change in explanation from x [Xj = Xj] to x[Xj <— Xj]. These 
evaluations are based on the concept of gain. We now discuss multiplicative gain, which is derived from 
multiplicative utility. 
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3.2.1 Multiplicative Gain 


The multiplicative gain is obtained as follows. Let x be the explanation before flipping, x' the explanation 
after flipping some number of nodes. A measure of the goodness of those flips is introduced in the following 
definition. 


Definition 18 (Multiplicative gain, one-flip multiplicative gain) Let x and x' be explanations, as- 
sume that Um(x) > 0, and suppose that we flip from x to x' . The multiplicative gain AUm{x, x') is defined 
as 


AU m (x, x') 


Um{x') 
U M (x) ' 


(10) 


Suppose that € f 1xi> Xij € and *[Xj = 
Xij, where j ^ k, is defined as 


AU M (x, Xj j ) 


Xj : k\. The general case of one-flipping x ’s X, ; from Xi ^ to 


Um(x[Xi < X^j]) 

Um (x[X{ x^^,]) 


Definition 19 (Binary one-flip multiplicative gain) Assume that X 4 is binary, x[X t = Xj], and let 
x' = x[Xi <— Xj]. Binary one- flip multiplicative gain is then defined as 


A U M {x,Xi) 


Um(x') 

U M (x) 


Um (x \X-i < X^]) 

Um (x\Xi — Xi ]) 


(12) 


We now show how multiplicative gain computation can be simplified compared to the above definitions. 


Theorem 20 (One-flip multiplicative gain) Let x and x' be explanations such that x' = x[Xi <— Xj]. 
Further, let C = H/.Yij let Cj € C, and let ’xiCj) C x (respectively Tt(Cj) C x') be the instantiation of all 
parent nodes for children of Xi before (respectively after) a one-flip of Xi. Suppose that Um{x) > 0. The 
one-flip multiplicative gain AUm(x,Xi) is 


Pr (zi I Tr(-X'i)) x Tlc.-ec Pr ( c i = c i I * {Cj)) 

M{X,Xi ’ ~ P*(xi I tt (Xj)) x I I, • . (• I’rif ) c, | 7 r(Cj)y 

with Pr(x; | 7T (Xi)) x I! c,ec Pr ( C 'i = c i I ^(Cj) > °- 

Proof. We start with the definitions of multiplicative gain and utility, obtaining 

Arr , U M (x[Xi <- X,]) Pr(x[X 4 <- x,]) 

%i) — TT ( \ — T 3 / \ 

U M {x) Pr(®) 


(13) 


(14) 


We now consider Xj’s children C = T Xi ■ Let Xbe the BN nodes other than Xi and C: Y = V— ({Xj} U C), 
and rewrite Equation 14 as 

_ Pr (^i = ^ | 7r(X,;)) x Ilc.ec Pr ( C i = c i I *(Cj)) x Uygy Pr ( y = V I 7r ( y )) 

M X ' 3 ' Pr (^i = Xi | 7r(X,;)) x Ilc.ec Pl ( C j = c j I n( c j)) x Uy&y Pr ( y = V I A Y )) ’ 

Since only one node X, is flipped (from x* to xi), we can exploit the Markov blanket locality of the BN and 
only need to consider the CPT in X, itself as well as CPTs for children C, and obtain 

, _ Pr P^ = Xi | TT Xi ) x ricjGC Pt ( C J = C 3 I *( C j)) 

M{X,Xi) Pr(Xj = Xi I wxi) X n Cj GC p r(Q = cj | 7 flCfl ) ' 

This is exactly the multiplicative gain AUm{x, x) in Equation 13. Obviously, since by assumption Um(x) > 0 
it is clear that Pr(xi | 7r(Xi)) x Y\c eC Pr (Cj = c j I n (Cj)) > 0 as well. ■ 

Note that the computation of Pr(x') of a flipped explanation can be expressed as taking the probability 
of the old explanation, Pr(x), and multiplying it with the gain from Equation 10, giving 


Pr(x') = A Um(x,x')~Pi(x). 


(15) 
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We typically restrict ourselves to the one-flipped case in this article, and consequently have x' = x[Xi 4— ac»] 
and thus obtain 


Pr(a/) = Pr(a 4 - Xi\) 

= AU m (x, Xi) Pr(x) 

_ Pr (^i =Xi\n Xz )x ricjec Pr (^j' = c i 
Pr(Xi = Xi | 7Tx,) x ric 3 ec Pr ( C ? = O' 


*(Gf)) 

ACj)) 


Pr(a;). 


(16) 


Especially when the number of nodes n is large, using (16) iteratively rather than evaluating all of x' to 
compute Pr(a;') provides substantial computational savings. 

In the use of multiplicative gain in SGS and other SLS algorithms there is a slight but crucial complication 
in that multiplicative gain is not defined, according to Definition 18, if the denominator Um(x) = 0. In 
practice, Um{x) = 0 is very common in partly deterministic BNs. We discuss how this can be handled in an 
ad-hoc manner in SGS in Section 4.1. However, this limitation of Um{x) also motivates our investigation 
of additive gain which we turn to now. 


3.2.2 Additive Gain 

The assumption Um{x) = Pr(a:) > 0 made for multiplicative gain in Equation 10 is sometimes not realistic 
and leads us to introduce additive gain based upon additive utility. 

Definition 21 (Additive gain, one-flip additive gain) Let x and x' be explanations. The additive gain 
A Ua{x,x') is defined as 

A U A (x, x') = Ua(x’) - U A (x). (17) 

Suppose that Xi^ £ Q Xi , %i,j G LI X( > an d x[X^ = Xi t k\. The general case of one-flipping x ’s X t from a ^ to 
jjA where j k, is defined as 


AU A (x,Xij) — U A (x[Xj / < Xi.j \ ) U A (.x[Xi — ^-i,fc]). (18) 

Definition 22 (Binary one- flip additive gain) Assume that Xi is binary , x[Xi = x.f\, and let x' = 

x[Xi <— xf\. Binary one- flip additive gain is then defined as 

A U A (x z Xi) = U A (x') - U A (x) = U A {x[Xi <- 5,;]) - U A (x[Xi = a ;»]). 

The additive gain is well-defined in cases where the multiplicative gain is not. As a consequence, we can 
in the SGS implementation of additive gain computation (see Figure 4) handle the case Pr(a:) = 0 without 
complicating the stochastic local search algorithm (see Figure 3) or changing the CPT entries (see [67]) as 
has been done when using multiplicative gain. As we will see in experiments, additive gain proves useful as 
a complement to multiplicative gain. 

Theorem 23 (One-flip additive gain) Let x and x' be explanations such that x' = x [A,; 4 — xf . Further, 
let C = 'S’ Xi , letCj £ C, and letir(Cj) C x (respectively it (Cj) C x’) be the instantiation of all parent nodes 
for children of Xi before (respectively after) a one-flip of Xi. The one-flip additive gain AU A {x,Xi) is then 

AU A (x,Xi) = Pr(i, ; | tt.yJ + ^ Pr (Cj = cj \ i f(Cj-)) - Pr(a;, | n Xi ) - ^ Pr (Cj = Cj \ tt (Cj)). (19) 

Gj&C CjGC 


Proof. The one-flip additive gain can be derived in a manner similar to the proof of one-flip multiplicative 
gain AUm{x,x) in Theorem 20. ■ 

3.2.3 Additive Gain and Multiplicative Gain 

We have several arguments for why the use of additive gain may be productive. One reason, which is stated 
formally in Theorem 27 below, relates additive gain and multiplicative gain. In order to improve readability, 
we introduce in Definition 24 and Definition 25 alternative ways of expressing additive and multiplicative 
gains when flipping a node X, with children C. 
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Definition 24 (Simple additive gain notation) Let a = |C| + 1 and put in Theorem 23: pi = Pr(:Efc | 

*x k ), E ?= 2 Pi = E Cj ec Pr ( C i = c i I Qi = Pr (®fc I *x h ), and Ef = 2 ft = Ec j ec Pr ( C l = Cj I 

7r(Cj)). This gives f/ie following simplified notation for additive gain: 

CX. Ot 

Au A (x,x k ) ~ 

i=l i=l 

Definition 25 (Simple multiplicative gain notation) Let |C| = a + 1 and put in Theorem 20: p\ = 
Pr(Sfc | tt.yJ, nr= 2 Pi = n Cj£ c Pr (^ = c o I Qi = Pr ( x k \ -Kx k ), and II = ric, 6 c Pr ( C i = c i I 

tt(Cj)). This gives the following simplified notation for multiplicative gain: 

CX 

AU M (x,x k ) = l[ (20) 

i=i 9i 

again under the assumption n; Li Qi > 0. 

The following fact follows easily from the definitions of A U A and AUm- 


Proposition 26 Let x and x' be explanations and suppose that A Um{x) > 0. A U A (x, x') = 0 if and only 

if AU m (x, x') = 1. 


We now turn to the more general case where AU a (x, x') ^ 0 and A Um(x, x') ^ 1. 

Theorem 27 Consider the definitions of additive gain AU a (x, x k ) = Ef=i Pi ~ Ef=i Qi ant d multiplicative 
gain AUm(x, x k ) = J|“_i from Definition 2f and Definition 25 respectively. Suppose that there exists 
a permutation a of {gi, . . . , q a }, denoted {Q a m, ■ ■ ■ , Qa(a)}> and a permidation p of {p±, . . . ,p a }, denoted 
{p P ( l), • • • ,Pp(a)}> such, that 

Pp( 1) A Qcr(l)i • • • )Pp(k) A Qct(k)iPp(k,+ 1) Qc r(«+lp • * • iPp(a) Qcr(a) (^1) 

for some 1 < k < a. If (21) is true and A U A (x,x k ) > 0 then AUM{x,x k ) > 1. 


Proof. Since, by assumption, AU a (x,x') = J2i=iPi — E 
1 < j < k, it is the case that p p (j\ > Q a (j ) or i n other words 


(X 





qi > 0, we must have n > 1 in (21). For any 
- > 1, giving also 


n 


3=1 


PpU) 

Qcr(j) 


> 1. 


(22) 


For the remaining elements in (21) we have p p ( K +\) = 9<t(k+i)> ■ ■ ■ ,P P ( a ) = Qa(a) and thus 


TT PpU) _ 1 

,i K +i Mo) 


(23) 


Going back to the definition (20) as well as (22) and (23) gives 


a 

AU M (x,x k ) = P — 

i Qi 


n— n 

i=i Mo) ,-iii 


PpU) 

Q&U) 


> 1 


as desired. ■ 

Theorem 27 justifies the use of A U A in MPE local search algorithms as follows: When flipping from 
explanation x to explanation x' — and if AU a (x,x') > 0 as well as condition (21) holds during local search 
— one is also making progress according to AUm- 

Despite the result above, we emphasize that a positive additive gain A U A does not always imply an 
increase in an explanation’s probability or multiplicative gain AUm- This is stated formally in the following 
theorem. 


Proposition 28 Let x be an explanation, and x’ be one-flipped from x. It is not always the case that if 
AU a (x, x') > 0 then A Um(x, x') > 1. 

Proof. A counter-example in a BN with two nodes X\ and X 2 is provided. The nodes both have 
state space {0,1}. Node X\ is A 2 ’s parent. Let the CPTs be {Pr(X| = 0) = 0.2, Pr(X-| = 1) = 0.8, 
Pr(X 2 = 0 | X 1 = 0) = 0.4, Pr(X 2 = 1 | X x = 0) = 0.6, Pr(A 2 = 0 | X x = 1) = 0.05, Pr(X 2 = 1 j 
X\ = 1) = 0.95}. Now consider x={Xi = 0,A 2 = 0} and x' = {X^ = 1,A 2 = 0}. In this case we have 
A U A (x,x') = U A (x')-U A (x) = (0.8 + 0.05) -(0.2 + 0.4) = 0.25 > 0 while A U M (x,x') = U M (x')/U M (x) = 
(0.8 x 0.05)/(0.2 x 0.4) = 0.5 ^ 1. ■ 
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3.2.4 Additive Gain and GSAT Gain 


The reasoning behind the GSAT gain [78,79] and the additive gain shown in Equation 19 is similar. Reflecting 
this, we introduce the following two definitions based on previous research [78,79]. 

Definition 29 (GSAT gain) Let r and r' be truth assignments. Then GSAT gain A Us (t,t') is defined 
as 

A U s (r, t') = A U s (t') - A U s (r) . 

The notation A Us{t,t') is used for GSAT gain over truth assignments r and r' for SAT formulas, 
similar to additive gain AUa(x, x') for explanations x and x' in BNs as introduced in Definition 21. 

Definition 30 (One-flip GSAT gain) Let t be a truth assignment to the variables y in a SAT instance 
4> = ( y , q, c). Suppose that t is changed to t by flipping one variable y £ y from r{y) to f(y). Let 
MakeCount(t, y) be the number of clauses satisfied (or made 1) by the flip ofy. Let BreakCount(t, y) 
be the number of clauses unsatisfied (or made 0) by the flip of y. One-flip GSAT gain is defined as 

A Us (r, y) = MakeCount (t, y) - BreakCount (t, y) . 

In a manner similar to GSAT gain’s success in SAT, our additive gain turns out to be powerful on 
deterministic BN nodes, which occur in application BNs, as well as in SAT-like BNs which encode satisfiability 
instances [8,73]. In fact, GSAT gain is a special case of additive gain as expressed in the following result. 

Theorem 31 Let BN (3 = (Y, W, P) and suppose that SAT instance (f> = [y, q, c ) is SAT -equivalent to 
(3. Further, let two explanations Xi and x 2 in (3 be SAT-equivalent to two truth assignments r i and t 2 , 
respectively, for </>. Then AUa(x \ , x 2 ) = A Us (ti,t 2 ). 

Proof. Applying Theorem 15 to both x\ and x 2 gives 


Ua{x i) = Us(t i) + j 

(24) 

U a (x 2 ) = U s (t 2 ) + j, 

(25) 

and when (24) is subtracted from (25) we obtain 

Ua(x 2 ) - U A (x i) = U s {t 2 ) - U s {t i). 

(26) 


Simply substituting the definitions of AUa(xi, x 2 ) (Definition 21) and AUs {ti,t 2 ) (Definition 29) into 
(26) gives the desired result, m 


The following result follows easily from Theorem 31. 

Corollary 32 Let AU A(xi,Xi) be the additive gain obtained by flipping a root node Xi £ X in a SAT-like 
BN (3 = ( Y , W, P). Assume that the SAT instance <p = (y, q, c ) is SAT-equivalent to (3 and that T\ is 
equivalent to x\. Further, let AUs (r\,yi) be the GSAT gain obtained, under Ti, by flipping the variable 
yi £ y corresponding to the root node £ X. It is then the case that AUa(x\, xf) = A Us (t\ ,yf). 

Proof. Put x 2 <— x-[ [Xi <— Xi] and let t 2 <— r(</>, yi, Ti) (copying T\ into r 2 but exchanging r^yf) with 
t(j/i)). Observe that node Xi corresponds to the variable y, . Clearly, the constructed truth assignment r 2 
is equivalent to the constructed explanation x 2 , thus Theorem 31 applies and the result follows. ■ 

3.3 Criteria of Choice 

In addition to measures of utility and gain, SLS algorithms including SGS need criteria of choice. These 
criteria determine how an SLS algorithm decides, based on its gain computations, which BN node(s) to flip 
next. Gains in a neighborhood around the current explanation x are computed and placed in a candidate 
array (or set), defined as follows. 

Definition 33 (Candidate array) Let x be an explanation and let gij = A U(x, Xij) be the one-flip gain 
obtained by flipping explanation x ’s i-th node Xi from its k-th state k £ {1, . . . , | | } to its j-th state, where 
j £ {1, . . . , | fix, 1} — {£;}. The set of 3-tuples 

A = {(Xi,Xij, gij) | Xi is the i-th BN node, Xij is Xi ’s j-th state, and gij is the one-flip gain} 
is denoted the candidate array. 
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At present, gij = AU(x, Xij) is either multiplicative gain A U m(x, Xij) ( 11 ) or additive gain At /4 (a:, %i,j) 
(18). The candidate array describes explanation ars local neighborhood and is used to determine — using a 
criterion of choice — which node to actually flip in x. We have investigated two types of criteria of choice, 
the greedy criterion C'c and the noisy criterion Cn- Both criteria rely on the candidate array A and a tuple 
of gains in A, formally G = (gi, . . . , g m )- 4 

Definition 34 (Greedy criterion Cq) The greedy (or maximizing) criterion Cg{A) is defined as picking 
a tuple with gain g m ax such that g max > < 7 j for all 1 < j < m. Among k such maximal tuples, Cg(A) picks 
the i-th tuple with probability pi = 1 jk for 1 < i < k; all remaining tuples are picked with probability Pi = 0 
for k + 1 < i < m. 

Definition 35 (Noisy criterion Cn) The noisy (or stochastic) criterion Cn{A) is defined as picking the 
i-th tuple with gain gt, for 1 < i < m, according to the probability distribution 

/ m 

■ 

3=1 

E m 

Pi = 1 in Definition 34 and in Definition 35, thus these are both valid 

probability distributions. 

In SGS, the maximizing criterion in Definition 34 chooses greedily from the candidate array A and 
yields classical steepest-ascent hill-climbing. The stochastic criterion in Definition 35 yields probabilistic 
hill-climbing, since it probabilistically decides which node and state to flip to, using gain information in 
the candidate array A. This stochastic criterion is closely related to the approach used in the stochastic 
simulation algorithm [72] as well as to noise strategies such as “random walk” used in stochastic local search 
for SAT [78], 

A measure of gain, then, defines how to measure progress for candidate local search steps. A criterion 
of choice, on the other hand, determines how to choose between the different candidate local search steps, 
based on the candidate gains, using a maximizing criterion or a stochastic criterion. 


4 Stochastic Local Search Algorithms 

Stochastic greedy search (SGS) is a local search approach augmented with stochastic initialization and noise 
steps. This section presents two variants of SGS, in order of increasing complexity and power. Section 4.1 
presents simple SGS (or SimpleSGS), while Section 4.2 presents operator-based SGS (or OperatorSGS). 
Among these, SimpleSGS is most similar to other SLS algorithms, and acts as a stepping stone to reach 
OperatorSGS, our portfolio-based main contribution. 

4.1 Simple Stochastic Greedy Search 

SimpleSGS, which is presented in Figure 2, starts from an explanation x as constructed by an initialization 
algorithm Initialize and performs one- flip local changes to x in order to improve the utility U(x). The 
input parameter U controls which measure of utility and gain is used, and is currently either Um or U a ■ 
The algorithm is related to the GSAT and WALKS AT algorithms [78,79] as well as other SLS algorithms as 
discussed in Section 7: There is an outer loop for tries (or restarts), and an inner loop for flips (or stochastic 
local search steps). The parameter MAX-FLIPS limits how many flips are done before a restart, while the 
number of tries is upper bounded by MAX-TRIES . 5 The parameter pn controls the noise level. Applying 
noise amounts to taking a random step with probability pn, and taking a greedy step using the chosen 
measure of gain with probability 1— pn ■ This is similar to WalkSAT with random noise [78]. SimpleSGS 
terminates when an explanation x* is found such that U(x*) > U m i n . Other termination criteria can easily be 
incorporated into SimpleSGS. In particular, we have implemented the option of terminating after executing 
SimpleSGS for a certain time period, even though this is not shown in Figure 2. 

The SimpleSGS algorithm returns a two-valued tuple (fe, x*). The first value b is a Boolean value 
signifying whether search was successful or not; the second value x* is the explanation with the highest 

4 For the binary case and when there is no evidence, m = n since an explanation in a BN with n nodes has n possible one-flip 
gains. 

’An alternative to the use of MAX-TRIES, not further pursued in this article, would be to use an upper bound on computation 
time in wall-clock time. Further discussion of termination criteria can be found in Section 5.1. 
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SimpleSGS(/3, e, MAX-TRIES, MAX-FLIPS, p N , U, C, U min ) 

Input: /? Bayesian network 

e evidence 

MAX-TRIES maximum number of restarts 
MAX-FLIPS maximum number of flips 

Pn noise probability 

U measure of utility (and gain): Um (and A Um) or Ua (and A XJa) 

C criterion of choice: Cq or C/v 

C/ m in utility lower bound for termination 

Output: ( b , x*) b £ {true, false}; x* is a best effort estimate of MPE x* 

begin 

x* <— INITIALIZE (/?, e) {initialization uniformly at random (similar to NU operator; Section 4.2)} 
for i <— 1 to MAX-TRIES {loop of tries} 

x <— Initialize}/?, e) {initialization uniformly at random (similar to NU operator; Section 4.2)} 
if (U(x) >U(x*)) then x* <— x 
if (U{x*) ^ Umin) then return (true, x*) 
for j <— 1 to MAX-FLIPS {loop of flips} 

Noise <— (Randomize}) < p jv) {decide between hill-climbing and noise, set Noise flag)} 
if (not Noise) then {Noise = false, perform greedy hill-climbing step} 

if (U = Um) then A <— ComputeMultiplicativeGain(/3, e, x) {put Um gains in array A} 
else A <— Compute AdditiveGain(/ 3, e, x ) {put Ua gains in array A} 

if (A = 0 ) then 

if (pn = 0) then break 

else Noise <— true {no candidates, apply noise} 

else 

( Xi,Xij ) <— ChooseStAte(A, C) {pick node A, ; and state Xij from array A} 

endif 

endif 

if (Noise) then {Noise = true, perform noise step} 

Xi <— pick node X t from /? at random 

x i,j pick state Xjj G Ox, — { x i,k} at random {avoid the current state Xi^} 

endif 

x <— x[Xj <— Xij] {set Xi = Xij in explanation x} 
if (U(x) > U(x*) then x* <— x 
if (U(x*) > Umin) then return (true, x*) 
endfor {loop of flips} 
endfor {loop of tries} 

if (U m i n = 0) then return (true, x*) else return (false, x*) 

end 

Figure 2: The simple stochastic greedy search algorithm SimpleSGS. The algorithm repeatedly interleaves 
noise and greedy hill-climbing steps; hill-climbing is done by calling either ComputeMultiplicativeGain 
or ComputeAdditiveGain. These functions return a candidate array A which contains candidates for the 
next hill-climbing step. A candidate is a tuple (Xi, Xij, gij), where Xi is the BN node potentially being 
flipped, Xij is a state of that node, and <ji.j is the gain obtained by flipping Xi to a: The ChooseStATE 
function applies the criterion of choice C to pick a node A,; and state Xij, which is then subsequently flipped 
to. 
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ComputeMultiplicativeGain(/ 3, e, x) 

Input: /3 Bayesian network 

e evidence 
x explanation 

Output: A candidate array of gains computed using multiplicative gain 

begin 

X <— nodes in Y, where /3 = ( Y, W, P), that are not in e 
Init(A, 1) {for multiplicative gain initialize candidate array A with Is} 
for each non-evidence node Xi G X 
o <— Xi {remember original state a ;,;} 

Old <— NODECHILDRENPROBABILITY(/3,Xj, X,Um) 
for each state s G — {xi} {try all states except state Xi} 
x 4— x[Xi <— s] {try state s in *-th node Xi} 

New <— NodeChildrenProbability(/3, Xi, x,Um) 
if (Old = 0) then 

if (New = 0) then g <— 1 
else 

Force(A, Xi, s) {“force” state s into candidate array A} 

return A 
endif 
endif 

g <— New / Old 

Add(A, Xi, s,g) {add tuple (. Xi,s,g ) to candidate array A} 

endfor 

x <— x[Xi <— o] {reset state of X \ to original value o} 

endfor 
return A 

end 

Figure 3: This algorithm, which implements the multiplicative measure of gain A Um, is called by SGS and 
is used to fill in the candidate array A with gains for neighbors of the current explanation x. 

ComputeAdditiveGain(/3, e, x) 

Input: /3 Bayesian network 

e evidence 
x explanation 

Output: A candidate array of gains computed using additive gain 

begin 

X <— nodes in Y, where (3 = ( Y, W, P), that are not in e 
Init(A, 0) {for additive gain initialize candidate array with Os} 
for each non-evidence node X t G X 

o < — Xi {remember original state Xi of node X t } 

Old <— NODECfflLDRENPROBABILITY(/3,Xj, X, U a) 
for each state s G — {^i} {try all states except state Xi} 
x 4 — x[Xi <— s] {try state s in *-th node Xi} 

New 4 — NodeChildrenProbability(/3, X t , x,U a ) 
g 4 — New - Old 

Add(A, Xi, s, g) {add tuple ( Xi,s,g ) to candidate array A} 

endfor 

x 4 — x[Xi 4 — o] {reset state of node X t to original value o} 

endfor 
return A 

end 

Figure 4: This algorithm, which implements the additive measure of gain A Ua, is used by SGS to fill in the 
candidate array A with gains for neighbors of the current explanation x. 
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utility score. The input value pn = 0 means that no noise should be applied; in this case SimpleSGS 
returns when reaching a local maximum. With U m i n > 0, SimpleSGS is successful if utility U m or 
better is attained. SimpleSGS uses gain computation algorithms ComputeMultiplicativeGain and 
ComputeAdditiveGain, see Figure 3 and Figure 4 respectively, to evaluate candidate tuples and construct 
the candidate array A. Which gain computation is used depends on the value of the input parameter U. Both 
gain computation algorithms iterate over all non-evidence nodes and over all states of each node X,; . excluding 
a node’s current state Within the inner-most loops there are calls to NodeChildrenProbability, 
which are used to compute the gain g associated with flipping node Xfs state in explanation x. 

We now discuss how the ComputeMultiplicativeGain and ComputeAdditiveGain algorithms im- 
plement multiplicative gain A Um and additive gain AUa respectively. In the implementation of the mul- 
tiplicative measure as shown in ComputeMultiplicativeGain in Figure 3, the assignment to Old is the 
denominator in multiplicative gain, while the assignment to New is multiplicative gain’s numerator. Recall 
from Definition 18 and Theorem 20 that multiplicative gain is not defined if the denominator is zero. Conse- 
quently, there is a test for zero in ComputeMultiplicativeGain, and the following action is taken: If Old 
(denominator in Equation 12) and New (numerator in Equation 12) are both zero, we define multiplicative 
gain g to be one: g <— 1. If Old is zero but New is non-zero, we greedily pick the new state s in the statement 
FORCE (A, A,;, s), and then return. So in this case where Old (denominator) is zero but New (numerator) 
is non-zero, the algorithm is more greedy than in the remaining cases, since the first state in the first node 
where a flip leads from a zero to a non-zero probability is picked as a candidate for flipping, by using FORCE. 

Both ComputeMultiplicativeGain and ComputeAdditiveGain return a filled-in candidate array 
A to SimpleSGS. After computation of A, SimpleSGS calls the ChooseState function. ChooseState 
applies the criterion of choice and picks which state Xjj to flip to. Then, after a possible noise step, the 
utility measure U(x) is applied. Using either multiplicative utility Um or additive utility C/ 4 , the better 
explanation of x and x* is kept. The algorithm then terminates or iterates. 

SimpleSGS has been very successful in certain BNs, as is demonstrated in Section 6.1. It turned out that 
the algorithm also has some limitations, including the following. First, SimpleSGS does not accommodate 
an arbitrary number of different initialization and search operators in the same invocation of the algorithm. 
Second, the measures of utility and gain are coupled through the input parameter U, and it is not possible 
to apply different measures of gain in the same invocation of the algorithm. Addressing these limitations 
is crucial to obtaining strong performance. We now discuss how they are addressed in the operator-based 
variant of SGS. 

4.2 Operator-Based Stochastic Greedy Search 

The operator-based variant of SGS, denoted OperatorSGS, is presented in Figure 5. OperatorSGS re- 
tains the structure of SimpleSGS, with two nested loops for tries and flips. What is novel in OperatorSGS 
is how the algorithm chooses from two sets of operators or heuristics when an initialization or search step is 
performed. This operator-based approach gives better flexibility and extensibility, resulting in substantially 
better performance on a range of different BNs compared to SimpleSGS. We will return a more detailed 
discussion of OperatorSGS after formally introducing its initialization and search operators. 

Definition 36 (Initialization operator, initialization operators) An initialization operator is a heuris- 
tic that computes, given a BN fd and evidence e, an explanation x. A set of initialization operators is denoted 
$/• 


The initialization operators that have so far been explored using OperatorSGS are: uniform initial- 
ization (UN), forward simulation (FS), backward dynamic programming (BDP), and forward dynamic 
programming (FDP); we thus have <&/ = {UN,FS,BDP,FDP} [63]. Other algorithms, for example the 
mini-bucket algorithm [41], can easily be incorporated as initialization operators in OperatorSGS. 

Definition 37 (Search operator, search operators) A (local) search operator is a heuristic that com- 
putes, for a BN (3 with evidence e and an explanation x, a new explanation x' . A one-flip (local) search 
operator is a search operator where x' = x[Xi <— xf\. A set of search operators is denoted 4 * 5 . 

Our search operators are closely related to flip selection strategies such as GSAT [79], WSAT-G [54], 
WSAT-B [54], WSAT [54,78], Novelty [54], Novelty+, and DLM [76] in the SAT setting, and similar strategies 
in the BN setting [41,48,61,68,69]. Traditionally, flip selection strategies are hard-coded using an if- 
then-else statement containing a greedy part and a noisy part similar to what is done in SimpleSGS. 
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In OperatorSGS, on the other hand, these two parts are typically represented by means of two distinct 
operators in a search portfolio. More importantly, OperatorSGS is not restricted to two operators. We 
will return to the issue of flip selection strategies and search operators after introducing the following type 
of search operator. 

Definition 38 (Compound operator) A compound operator is a 2-tuple (AU(x, Xij), C{A)), where 
A U{x, Xij) is a gain function and C(A) is a criterion of choice operating on a candidate array A. 

Example gain functions referred to in the above definition are additive gain A Ua{x, x^j) and multiplica- 
tive gain AUm(x, Xij). Compound operators are used by CompoundSearch, see Figure 6. 

The operators {MG, AG, MN, AN} are compound operators that we have used in OperatorSGS so far; 
these operators are defined as follows. 

Definition 39 (Compound operators MG, AG, MN, and AN) Using C(A) = Cg{A) (the greedy 
criterion) we define these greedy search operators : G 

MG := (AUm(x, Xij), Cq{A)) {multiplicative gain with greedy criterion} 

AG := (A Ua{x, Xij), Cg(A)) {additive gain with greedy criterion}. 

Using C(A) = Cn(A) (the noisy criterion) we define these noisy search operators: 

MN := (At/ m (x, Xij), Cjv(A)) {multiplicative gain with noisy criterion} 

AN := (AU a (x, Xij), Cjv(A)) {additive gain with noisy criterion}. 

The AG operator is formally justified by Theorem 27 and Theorem 32. Informally, there is the following 
relationship between maximizing GSAT gain and maximizing additive gain: Using GSAT gain, one flips the 
logic variable which gives the highest increase in the number of satisfied clauses [79]. Using the additive 
operator AG, we flip the BN node which gives the greatest additive increase in values of the affected CPTs. 

In addition to the compound operators introduced in Definition 39, there are operators that are not 
formed by combining a gain function and a criterion of choice. For example, we have investigated the 
uniform noise (NU) and stochastic simulation (SS) operators. Uniform noise means that a non-evidence 
node, picked uniformly at random, is flipped with probability pn- Stochastic simulation is Gibbs sam- 
pling in BNs, where nodes are sequentially sampled [72]. Altogether, this gives search operators $5 = 
{NU,SS,MG,AG,MN,AN}. 

An operator portfolio A, formally introduced in Definition 40 below, is generally speaking a set of oper- 
ators along with an approach to operator selection. We focus here on selection according to a probability 
distribution defined by associating a probability with each operator as follows.' 

Definition 40 (Stochastic portfolio) Let q> 1 and let <f> = . . . , <f > q } be a set of operators. A stochas- 

tic portfolio over $ is a set of q operator tuples A = {vi, . . . , v q } = {((f> 1 ,pi), . . . , ( <f> q ,p q )} where 0 < Pi < 1, 
Yli=iPi = 1> an d (</> i,Pi ) means that the i-th operator (f>i, where 1 < i < q, is selected (and then executed) 
with probability pi whenever some operator is selected from A. 

Initialization algorithms have turned out to be important components in SGS as well as in other sim- 
ilar algorithms [41,48,63,69]. In OperatorSGS, the input parameters therefore include a portfolio of 
initialization operators A/, defined as follows. 

Definition 41 (Initialization portfolio) Let <!>/ be a set of initialization operators. The OperatorSGS 
initialization portfolio A / is a stochastic portfolio over <f>/ according to Definition )0. 

The initialization algorithms construct explanations which make up starting points that the local search 
steps incrementally improve, leading to increasingly better MPE estimates. We have investigated the follow- 
ing initialization operators for stochastic generation of initial explanations: uniform initialization (UN), 
forward simulation (FS), forward dynamic programming (FDP), and backward dynamic programming 
(BDP) [63]. These stochastic initialization operators all have a time complexity of 0(n) in the number 
of nodes. A wide range of initialization heuristics can easily be incorporated as initialization operators in 
OperatorSGS. 

The portfolio A 5 for the local search heuristics is formally introduced as follows. 

6 In previous work [61,63] MG was called BM, AG was called GM, MN was called BS, AN was called GS. 

'An alternative to the stochastic portfolio is a round-robin portfolio or schedule. A round-robin portfolio is a sequence of q 
operators 4 = (cf> 0 , . . . , <j> q _ 1 ) such that the j - th time an operator is selected for execution from 4, operator j, where i = j modr 
is selected. We do not further investigate round-robin portfolios in this article. 
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Greedy: p G = Pag + Pmg 

Noisy: p N = pan + Pmn 

Additive: 

PA = PAG + PAN 

Additive & greedy: (AG, Pag) 

Additive & noisy: (AN, Pan) 

Multiplicative: 

Pm = Pmg + Pmn 

Multiplicative & greedy: (MG, pmg) 

Multiplicative & noisy: (MN, pmn) 


Table 1: Joint probability table decomposed according to the two orthogonal dimensions of (i) greedy versus 
noisy search and (ii) additive versus multiplicative gain computation. Taken together, these two dimensions 
define four operators or two-tuples of search operators (or heuristics) and probabilities as illustrated. 


Definition 42 (Search portfolio) Let $g be a set of search operators. The OperatorSGS search port- 
folio A s is a stochastic portfolio over <f>g according to Definition fO. 

The input parameters of OperatorSGS, see Figure 5, largely overlap those of SimpleSGS but include 
a portfolio of search operators A g and a portfolio of initialization operators A/. OperatorSGS chooses only 
among the operators included in the portfolios A j and A g. More specifically, Initialize^, e, A/) returns 
a new explanation x for /?, created using the initialization operators in the initialization portfolio A / (see 
Definition 41). SeARCH(/ 3, e, x, A 5) performs a search step on the explanation x, using the measure of gain 
associated with the search operator chosen from the search portfolio Ag. A search operator is picked from 
the search portfolio A g as described in Definition 42. Compound operators are invoked by Search using 
CompoundSearch, see Figure 6. 

This operator-based approach allows us to use a mixture of different operators, thus allowing us to 
emulate a wide range of SLS algorithms investigated previously. Here are a few examples from the literature. 

Example 43 For computing SAT, Selman et al. investigated, in the mixed random walk strategy, greedy 
search ( GSAT ) combined with a random walk (RW ) [78]: A g = {(GSAT, pgsat), (RW, Prw)}- 

Example 44 For computing MPE, Kask and Dechter investigated greedy search ( GR) with stochastic sim- 
ulation (SS) using mini-bucket (MB) initialization [fl[: A g = {(GR, pgr), (SS, pss)} and A / = {(MB, 
!)}■ 

Example 45 For computing MAP, Park and Darwiche investigated (among a total of 11 SLS algorithms) 
hill-climbing (Hill) with random noise (NU), using MPE-based initialization (MPE) [69]: A g = {(Hill, 
Phill), (NU, Pnu)} and A/ = {(MPE, 1)}. 

For further examples, Schuurmans and Southey investigated eight flip selection strategies [76]; these could 
also be formulated as OperatorSGS search portfolios. 8 

Turning to our additive approach, we now introduce a particular search portfolio which contains additive 
operators; see also Table 1. 

Definition 46 Consider the compound operators from Definition 39 and suppose that pan + Pag + Pmn + 
Pmg = 1- We define A s (pan, Pag, Pmn) = {(AN, pan), (AG, Pag), (MN, pmn), (MG, pmg)}, where p M G 
= 1 — pan ~ Pag ~ Pmn , as a search portfolio. 

In Section 6.2 we will see that this search portfolio is very effective on application BNs. Also, the search 
portfolio in Definition 46 enables us to discuss an analogue between the probabilistic use of our additive 
measure and the use of noise in SLS. Varying the noise probability pat, or the probability of applying a 
non-greedy heuristic, has been found to have a dramatic impact on SLS run time [20,31,34,54,58,77,78]. 
Utilizing our additive utility measure, we introduce the concept of additive probability pa, or the probability 
of applying an additive heuristic, which is similar but orthogonal to noise probability pjv- In Definition 46, 
only the operators AN and AG are additive, hence pa — Pan +Pag- Varying the additive probability pa 
turns out to have a major impact on SLS computation time when searching for MPEs in partly deterministic 
BNs, as presented in Section 6. 

8 Schuurmans and Southey introduced a novel flip selection strategy called DLM; in addition they considered the existing 
strategies GSAT, HSAT, WSAT-G, WSAT-B, WSAT, Novelty, and Novelty-)-. 
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OperatorSGS}/?, e 

Input: 


Output: 

begin 

~ * 

X 

for 


U, MAX-FLIPS, MAX-TRIES, U min , A/, A<?) 

Bayesian network 
evidence 

measure of utility: Ua or Um 
upper bound on the number of flips 
upper bound on the number of tries 
lower bound for termination 
initialization portfolio — initialization operators 
search portfolio — greedy and noisy search operators 
is estimate of MPE 


P 

e 

U 

MAX-FLIPS 
MAX-TRIES 

L'min 

A/ 

As 

(b, x*) b £ {true, false}; x 


Initialize (/?, e, A/) {use initialization operator to create explanation x*} 

— 1 to MAX-TRIES {outer loop of tries} 
x <— Initialize}/?, e, A/) {use initialization operator to create explanation x} 
for j <— 1 to MAX-FLIPS {inner loop of flips} 

x <— Search!/?, e, x, As) {apply search operator to update x\ 
if (U{x) >U(x*)) then x* <— x 
if (U(x*) >f/rnin) then return (true, x*) 
endfor {inner loop of flips} 
endfor {outer loop of tries} 
return (false, x*) 


end 


Figure 5: The operator-based stochastic greedy search algorithm (OperatorSGS). OperatorSGS com- 
putes an MPE estimate x* given the input parameters, and operates in two main phases: an initialization 
phase and a local search phase. The initialization algorithm Initialize applies initialization operators from 
A/, while the search algorithm Search applies search operators from A g. OperatorSGS terminates if an 
explanation x* of utility U m ; n or greater is found or if the computation exceeds MAX-TRIES tries. 


CompoundSearch}/?, e, A U, C, x) 


Input: 

P 

Bayesian network 


e 

evidence 


A U 

measure of gain: AUa or A Um 


C 

criterion of choice: Cg or Cn 


X 

explanation 

Output: 

X 

explanation 


begin 

if (A17 = A Um) then 

A <— ComputeMultiplicativeGain(/ 3, e, x) 
else {the case A U = A Ua} 

A 4 — ComputeAdditiveGain(/ 3, e, x) 

endif 

(. X{,X{ ) 4 — ChooseState(A, C) {pick node V and state v from candidate array A} 
x 4 - x[Xi 4 - Xi\ 

return x 

end 

Figure 6: An algorithm used by the Search algorithm in OperatorSGS. It performs stochastic local search 
on the input explanation x , using a compound operator defined by the input parameters A U and C . The 
result is an updated explanation x, which is output. 
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5 Analysis of Portfolios in Stochastic Local Search 

The goal of this section is to analyze SLS portfolio effects and the initialization and search portfolios of 
OperatorSGS in particular. Given our observation that this operator-based approach can simulate many 
other SLS algorithms, including SimpleSGS, the analysis is hopefully of broad interest and complements 
previous SLS analysis efforts [32,58,63]. 

Section 5.1 discusses the incompleteness of SGS, and argues for the Las Vegas analysis and experimenta- 
tion approach we have used in this article. We discuss our Markov chain-based analysis approach in Section 
5.2, present results for the initialization portfolio in Section 5.3, and for the search portfolio in Section 5.4. 

5.1 Incompleteness, Termination, and Las Vegas Algorithms 

Stochastic local search algorithms are in general incomplete algorithms. Specifically, if there is no input 
parameter like SGS’s U m - ln = Pr(ai*), one cannot say whether a current estimate x* £ X* or x* X*, and 
this raises the question of when to terminate search. Incomplete algorithms are appropriate when complete 
methods are impractical due to excessive demands on time, space, or both. For example, with limited space 
a complete algorithm like clique tree clustering might not even be able to compile a given BN [55,62,82]. 
Alternatively, there might only be a fixed amount of time available that is too short for complete computation, 
and one might at the same time be willing to sometimes make a mistake. For example, there are situations 
where the cost of making an occasional mistake by finding x* ^ X*, is small compared to the cost — in 
terms of space or time — needed to use a complete BN inference algorithm such as clique tree clustering. 

Since SLS algorithms are randomized, their operation may be described using bivariate distributions with 
random variables for (i) the running time (in number of operations, say) and (ii) the current estimate of the 
utility of an optimal solution. Unfortunately, keeping track of both execution time and the quality of the 
MPE estimate complicates scientific analysis and experimentation. For simplicity, we generally take a Las 
Vegas approach [25] and use U m i„ = Pr(a:*), making SGS terminate only after an MPE is found. Clearly, in 
applications one will in general not know Pr(cc*), and therefore cannot use t/ m j„ = Pr(a:*) as a termination 
criterion in SGS. Using £/ m i n = Pr(a:*) might therefore by some readers be regarded as “cheating”. However, 
in a scientific study such as the present we argue that using U m ; n = Pr(a?*) is entirely appropriate and indeed 
often to be preferred. First, it is the hardest test of SLS algorithms in terms of computing an explanation, 
since they are not allowed to terminate with x* ^ X*. Second, since the SLS algorithm can terminate as 
soon as some x* £ X* is reached, one avoids additional complication induced by the termination criterion. 
In this manner, the effect of the initialization and search algorithms (our main interest here) is cleanly 
isolated from the effect of the termination criterion, which in applications might be quite involved. 

The question of when to terminate SLS has, we believe, ultimately an application-dependent answer. 
Certain applications, including real-time applications, have hard upper bounds on execution time and they 
might need to terminate MPE computation if that hard upper bound is exceeded. Other applications do 
not have hard real-time requirements and softer termination criteria would be appropriate. Giving specific 
recommendations beyond these general observations is unfortunately beyond the scope of this article. 

5.2 Markov Chains and Hitting Times 

We now introduce definitions of a few random variables that will be used to characterize SLS performance. 

Definition 47 Let the flip length (number of SLS flips until termination) be a discrete random variable F. 
Let the run length (number of SLS operations, both initializations and flips, until termination) be a discrete 
random variable R. 

There are in fact families of random variables involved here. These families are characterized by the rj 
input parameters of an SLS algorithm, a tuple (0i = 0\, . . ., 0 V = 0 V ). In OperatorSGS, 77 = 7 and of 
particular interest are the initialization portfolio parameter 0 ^_i = ©6 = A/ as well as the search portfolio 
parameter Q v = ©7 = A 5 . In SimpleSGS we have 77 = 6 . Since F depends on the parameter values 6 = 
(0i, . . ., Orf) used, we say F(0) when we want to make this explicit. For example, if we consider MAX-FLIPS 
and vary m in MAX-FLIPS = m while keeping the remaining 77 — 1 input parameters constant, we may write 
F{m)\ R(m ) is similar. In addition, we may condition on portfolio parameters A / and A g of OperatorSGS; 
more about this below. 

The search processes of SimpleSGS and similar SLS algorithms can be analyzed using exact or approx- 
imate Markov chain models as follows (see also [32,58]). Let us for simplicity assume n binary non-evidence 
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BN nodes such that an exact Markov chain model has 2" states. This Markov chain is structured as a 
hypercube where each hypercube node b represents a bitstring. A state b £ {0, 1}” in such a Markov chain 
has n neighbors, namely states that are one flip away. 

Definition 48 (Neighborhood) Let b be a bitstring of length n. b’s neighbors n(b), strict neighbors n'(b), 
and non-neighbors n(b) are defined as follows 

n(b) = jcG {0,1}” 

n'(b) = n(b) — {b} 

n(b) = {0, 1}” — n(b). 

We now introduce the notion of exact Markov chains over bitstrings; they will be useful when analyzing 
how SLS algorithms process problem instances. 

Definition 49 (Exact Markov chain model) An exact Markov chain model A4 has states S = {so, si, 
...} = {b | b £ {0,1}”} and an initial probability vector V with Pr(Ao = Sk) = itk for 1 < k < 2”. The 
transition probability matrix V is a stochastic matrix given by 

Pr(A t+ i = c | A t = b) = 0 if c £ n(b) (27) 

Pr(A t+1 = c | A t = b) > 0 if c £ n{b). (28) 

For the uniform noise operator tuple (NU, pnu)> (28) is clearly Pr(A t+ i = c | A t = b) = 1/n in the 
transition matrix V of an exact Markov chain model. Probabilities as introduced in Definition 34 and 
Definition 35 also define rows in transition matrixes V. 

To enable introduction of operators, we slightly extend the exact Markov chain notation in Definition 
49 as follows. If there are £ different initialization operators, we have multiple initial probability vectors V; 

with Pr(Aj 0 = Sk) for 1 < i < £ and 1 < k < 2”. If there are \ search operators, we have multiple transition 

probability matrixes Vj for 1 < j < x- 

SimpleSGS and OperatorSGS can be analyzed using exact Markov chain models up to MAX-FLIPS 
flips, similar to the SimpleSLS algorithm [58]. In fact, using the exact Markov chain model, we can provide 
a sufficient condition for MPE computation using SimpleSGS. 

Theorem 50 Let MAX-TRIES = oo, MAX-FLIPS — oo, pn > 0, U = Um, and t/ m i n = Pr(a:*) in 
SimpleSGS. Then SimpleSGS is a Las Vegas algorithm and returns (true, x*) such that x* £ X* . 

Proof. Since MAX-TRIES = oo, there are two if-then statements from which SimpleSGS can return, 
and they both read “if (U( x*) ^ Umin) then return (true, x*).” By substituting in assumptions, we obtain 

U(x*) = U M (x*) = Pr(a*) ^ t/ min = Pr(x*), 

which clearly only holds when Pr(ai*) = Pr(a?*). If SimpleSGS returns from its first if-then statement 
we are done. Suppose that SimpleSGS does not return from the first if-then statement. We argue that 
SimpleSGS will eventually return from the second if-then statement as follows. Consider the exact Markov 
chain model M = ( S , V, V) corresponding to the particular input parameters for SimpleSGS. Pick any two 
distinct states s^ £ S and Sj £ S. Since by assumption pjy > 0, it is easy to see that s,; and Sj communicate. 
Thus, all states S , including optimal states O , are recurrent. Some optimal state x* £ O , where U m i n 
= Pr(a;*), will thus eventually be visited regardless of the initial state. When this happens, SimpleSGS will 
return (true, a;*). ■ 

The proof of the following theorem, which provides a sufficient condition for MPE computation using 
OperatorSGS, is similar to that of Theorem 50. 

Theorem 51 Let MAX-TRIES = oo, MAX-FLIPS = oo, (NU, Pnu) € A 5 with p^u > 0 , U = Um, and 
Umin = Pr(x*) in OperatorSGS. Then OperatorSGS is a Las Vegas algorithm and returns (true, x*) 
such that x* £ X* . 

Due to the exponential growth of V and V as a function of n in Definition 49, it is also of interest to 
formalize SLS behavior by means of approximate Markov chains. Let the current state of SGS search be 
b, and consider how search brings us closer to or farther away from an MPE b* . In the following definition 
we assume for simplicity, but without loss of generality [58], that b* = 1 ... 1; u( b) is a function that counts 
the number of Is, or in other words the correct states relative to b * , in b. Clearly, when u(b) = n we have 
b = b*. 
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Basic Random Walk: 




Augmented Random Walk: 


States for Search Operator 1 



State for 



State for Initialization State for Initialization 
Operator 2 Operator 3 


Figure 7: An example of a basic random walk (top) and a corresponding augmented random walk model 
(bottom) for OperatorSGS and other stochastic local search algorithms that use initialization and search 
portfolios. To reduce clutter, the states for the underlying search space are shown twice, namely in the 
context of the search portolio with % = 2 operators (bottom left) and in the context of the initialization 
portfolio with £ = 3 operators (bottom right). See also Example 53. 


Definition 52 (Positive and negative neighbors) Let the bitstring b, of length n, represent the current 
state of search and let b* = 1 ... 1 be the optimal bitstring. We then define positive neighbors n + (b ) = {e E 
n{b) | u(c) > u(b)j and negative neighbors n~(b) = {c E n(b) \ u(c ) < u(b)}. 

In an approximate Markov chain model, denoted a basic random walk in Figure 7, we introduce states 
that correspond to all counts of the number of correct bits u(b'). Clearly, 0 < u(b) < n. With the exception 
of the two boundary states 0 and n, each state i has two neighbors i — 1 and i + 1. Therefore we have a 
Markov chain model with n + 1 states in which each state has two neighbors; a so-called random walk. 

Can Markov chains such as random walks be applied also in the portfolio setting, and specifically to 
analyze OperatorSGS? As we will see in the following, the answer to this question is “yes”; we first 
illustrate our analysis by means of an example. 

Example 53 Suppose that we have a BN (j with n = 5 binary nodes. For OperatorSGS, let A j = {(</> 1 , 
Pi), (<f>2, P2), (03, P3)} and A 5 = {(w 1 , < 71 ), (u> 2, 92)}- Figure 7 shows an augmented Markov chain model 
M. ( see also Definition 54) for OperatorSGS processing fd, given these portfolios A/ and A 5. 

In addition to states representing the underlying search space, corresponding to the states in a basic 
random walk model, our novel augmented Markov chain model includes states that represent the search and 
initialization operators — see Figure 7. The advantage of this model is that it allows us to analyze search 
and initialization operators within the same framework as the underlying search space. 

The details for Example 53 are as follows. We first consider states S. The number of states in S is 
k = (n + l)(x + 1) + £ = 6x3 + 3 = 21. Among these, states {0, . . . , 5} represent OperatorSGS search 
right before search operator selection. States {6, . . . , 11} represent search right after selection of search 
operator oq; states {12, . . . , 17} represent search right after selection of search operator uq- States 18, 19, 
and 20 represent selection of initialization operators <f> 1 , </> 2 , and <f> 3 respectively. In V, Pr(J3o = i) = 0 for 
0 < i < (n + l)(x + 1) — 1 = 17 and Pr(i?o = 17 + j) = pj for 1 < j < 3, where pj are the initialization 
operator selection probabilities in Example 53. In V , we have = i \ B t = j) > 0 as indicated by 

edges in Figure 7. For all other entries in V , Pr(S t+1 = i \ B t = j) = 0. 

We now formally introduce our novel Markov chain construction that clearly reflects the initialization and 
search portfolios of OperatorSGS. This construction is illustrated in Example 53. We will in Theorem 
55 formally show that a Markov chain is created. 
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Basic Random Walk: 
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Figure 8: General constructions for a basic random walk model (top) and a corresponding augmented random 
walk model (bottom) for stochastic local search algorithms using initialization and search portfolios. 


Definition 54 (Augmented random walk) Consider a BN with n binary nodes. Suppose that we have 
an initialization portfolio Aj = {(4>i, Pi) | 1 < i < £} with initial probability vectors Vi for 1 < i < £ and a 
search portfolio A $ = {(cjj, qf) \ 1 < j < x } with transition probability matrixes Vj for 1 < j < %. The 
augmented random walk model ( S , V, V) is defined as follows. Put m = (n + 1) and £ = m(\ + 1). A4 has 
k = i + £ = |S| discrete states named {0, . . . , k — 1}. The k-dimensional vector V of initial probabilities is 
defined as Pr(.B 0 = *) = 0 for 0 < i < £ — 1 and Pr(£? 0 = i) = Pi-e + 1 > 0 for l < i < k — 1. The potentially 
non-zero entries in the k x k matrix V of conditional probabilities are defined as: 


Px(B t+1 =i\B t =j + £-l) = 

Pr(A Ji0 = b), 0 < i <n and 1 < j < £ 

(29) 


{b|u(b)=i} 


Pr(B t+1 =jm + i\B t = i) = 

q j, 0 < i < n and 1 < j < x 

(30) 

Pr(B t+ i =i + l\B t =jm + i) = 

J2 Pr(A jlt+1 = b + | Aj jt = b) 

(31) 


{b+| b+£n+(b)} 


Pr(B t+ i = i - 1 \ B t = jm + i) = 

Pr(^-,t+i = b Aj tt = b), 

(32) 


{b \b Gn _ (6)} 


where in (31) and (32) we have 0 <i <n and 1 < j < X-The remaining entries in V are zero. 

We now discuss Definition 54 in more detail; see also Figure 8. There is a one-to-one mapping between the 
£ last states in the augmented random walk model and the tuples in the initialization portfolio A j = {{<fi,Pi), 
. . . , (^,Pj)}- These £ initialization (operator) states are shown at the bottom in Figure 8; the probabilities of 
the edges from these states are given by (29). There is also a relationship between states 0 < i < (n+l)(x+l) 
and the tuples in the search portfolio As = {(u;i,gi), ..., (ui x ,q x )}: States 0 < i < n, which we will call 
(search operator) selection states , correspond to the moment, in OperatorSGS, right before search operator 
selection from As according to (30). States (n+1) < i < (n+l)(x+l) are called (search operator) application 
states. Each of these latter states correspond to a time when a search operator has been picked from As, and 
is about to be applied by OperatorSGS. Once an operator has been selected and applied using (31) and 
(32), the Markov chain is again in a selection state 0 < i < n, and the selection-application cycle continues 
until restart or termination. 

There are two options for Markov chain analysis of our portfolio approach. First, we can perform analysis 
directly on the augmented random walk. Second, we can use the augmented random walk merely as an 
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aid in constructing the basic random walk, and perform analysis on the basic random walk. In either case, 
since we have a Markov chain, Markov chain analysis techniques can be applied. In the rest of this work our 
main emphasis is on analysis directly on the augmented random walk, as reflected in the following result. 

Theorem 55 An augmented random walk model ( S , V, P) induces a Markov chain M. = (5, V, P) over 
the random variables (B t , t> 0). 

Proof. We consider S, V, and V in turn. S has k = (n + l)(y + 1) + £ = £ + £ states and obviously 
defines a finite state space. V is given by (7To, . . . , 7r^+{_i), where 

e+z-i e - 1 e+z-i j 

Y n i = Y 7r < + Y m = '52Pi = 1 

i = 0 i—0 i—i j — 1 

by the definition of A/ = pi), . . . , ((/>£, pj)}. For P’s £ initialization states we have for any 1 < j < £: 
e+£-i n 

Y Pr(B t+1 =i\B t =j+£-l) = Y / P*(B t+1 =i\B t =j + £-l)= Y Pr(A ii0 = b) = 1, 

2—0 2—0 { b\u(b)=i } 

which follows from (29) and the fact that A J: q is an initial probability vector in an exact Markov chain model. 
For P’s n + 1 (search operator) selection states we have for any 0 < i < n: 

2+1-1 x x 

Y Pr (- B t+ 1 = 3 \ B t=i) =Y Pr ( s *+1 = + i I = i) = Y Qj = 

j -o i l ./ l 

which follows from (30) and the definition Ag = {(uq, qi), . . . , (w x , q x )}. Finally, for P’s y(?r + 1) (search 
operator) application states we have for 1 < j < x an< i 0 < i < n: 

2+?-l n 

Y Pl i B t + 1 = k I B t = jm + i) = Y Pr ( B *+ 1 = k \ B t = jm + i) 

k = 0 fc=0 

= Pr(P t+ i = * + 1 | B t = jm + i) + Pr(B t+1 = i - 1 | B t = jm + i), 
and by substituting (31) and (32) into the above we obtain: 

2+|-l 

Y Fl '( B t + 1 = k I B t = jm + i) = 

fc= o 

y; Pr (Aj,t+i = b + | Aj t = b) + 'Y^ Pr(Aj t+1 = b \ Aj t = b) = 1, 

{b+|6+6n+(b)} {b~ |b“6n-(b)> 

where the last equality follows because ]P Pr(A,- 1+1 = c | Aj t = b) = 1 in an exact Markov chain model 

cGn'(6) 

and n'(6) can be partitioned into n + (b) and n~(b). Since we have now shown that (i) V is a valid initial 
probability vector of size k = |«S| and (ii) all of P’s £ + n + 1 + x(n + 1 ) — k rows sum to one and hence P 
is a valid k x k conditional probability matrix, the desired result follows. ■ 

For SLS algorithms, one is typically interested in expected run time and its minimization. In this article, 
we study the expected hitting time h(0). For our augmented random walk, the concept of expected hitting 
time is well-defined since the walk is a Markov chain. Given our results on additive utility and gain in Section 
3 we are in particular interested in h{pA), were pa is the probability of applying an additive operator in the 
search portfolio. We now turn to hitting time minimization and define 9*= arg min h(6) and in particular p* A 
= arg min h { pa)- The general problems of efficiently computing 6* and p* A are topics for ongoing research. 
In the rest of this section we provide a few analytical results; in addition there are substantial empirical 
results as reflected in the experiments of this article. 

5.3 Analysis of Initialization 

We now turn to an analysis of the OperatorSGS initialization portfolio. We consider different types of 
initialization portfolios, and introduce the following terminology. 
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Definition 56 Let A be a stochastic portfolio. If |A| = 1 then we call A a homogeneous portfolio. If | A| > 2 
then we call A a heterogeneous portfolio. 

Instead of varying the size of a portfolio, as implied by Definition 56, one can set selection probabilities 
to zero. For example, a portfolio with exactly one operator is obviously equivalent to a portfolio in which 
exactly one operator has selection probability one while all other operators have selection probability zero. 
For simplicity we exclude portfolios in which operators have zero probability of selection in our analysis in 
this section. 

We introduce a random variable $ to represent OperatorSGS’s random selection of an initialization 
operator from A/. Suppose there are £ operators in the initialization portfolio. Given $ and the random 
variable F representing the number of flips, we introduce conditional probabilities Pr(U | $ = <pf) and 
conditional expectations E(F | $ = 0J for 0 < i < f. For the special case of MAX-FLIPS = oo, we have 

E(F | $ = <t>i) = E{F | B 0 = j) = E(T | B 0 = j), 

where j = i + (n + l)(x + 1) — 1. 

We may consider different initialization portfolios of the form A j = (02, P 2 )) ■ ■ ■ > (0£>Pi)}- Of 

particular interest is the optimal initialization portfolio A J, formally defined as follows. 

Definition 57 (Optimal initialization) The optimal initialization portfolio A) = (0 2 ,P2)> ••• 

, (</>£, p|)} is defined by its optimal probabilities 

P* = {Pl,‘ - ■ ,P\) = arg nrin f ^ E(F | <F = (ffjPi 

s pi , —,Pt \ ' 

\j=i 

Clearly, the challenge here is to find p* . It turns out that homogeneous initialization portfolios, defined 
as A/,! = {((/q, 1)}, A 12 = {(0 2 , 1)}- A = {(0^, 1)}, play a central role as reflected in the following 
result. 

Theorem 58 Let £ be the number of operators in the initialization portfolio. Consider, for 1 < i < 
E{F | $ = 0J. 9 Suppose that E(F | $ = 0,) ^ E{F | <& = <f)j) for all 1 < i,j < £ where i ^ j. Then there 
is a unique optimal initialization portfolio A), namely some homogeneous portfolio A jg for 1 < i < £. 

Proof. Since E(F | $ = 0J ^ E(F | <f> = <f>j) for 1 < i, j < £ where i ^ j, we assume without loss of 
generality the strict ordering of homogenous portfolios E(F | <f> = 0 X ) < E(F | $ = 0 2 ) < ••• < E(F | <F = 
0j). Consider now the expectation E(F) for an arbitrary heterogeneous initialization portfolio {(4>i,Pi), 
(0 2 , P 2 ); • • • , (0£,p£)}. Using the law of conditional expectation gives 

E(F) = jf,E(F\ $ = 0,) Pr($ = 0J = E E(F | $ = 0^. (33) 

i—1 i—1 

By our strict ordering assumption of the homogenous portfolios, we can for 2 < i < £ write E(F | <F = 0J = 
E(F | $ = 0j) + Cj for increasing constants c, > 0. For convenience, we define C\ = 0. Now, (33) can be 
written as 



E{F) = 't(E{F \ * = </> 1 ) + a)p i 
= E E(F I $ = 01 )pi + E c iPi 

i = 1 i= 1 

= E{F 1 $ = 0r) + E CiPi. (34) 

i—1 


There are now two cases to consider. Case (i): p\ = 1. From (34) it is easy to see that E(F) = E(F | ^ = 0 2 ) 
since pi = 0 for i ^ 1. Case (ii): pi < 1. From (34) it follows that E'(F) > E(F | 4> = f^), since clearly 

9 Note that E(T | $ = <j>^) is a number while E(T | <£) is a random variable. In the former case the initialization portfolio 
is given as a specific portfolio while in the latter case is not specified. 
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I 

Y CiPi > 0. Consequently, E(F | $ = <pf) is the minimal expectation, and the homogeneous initialization 
i — 1 

portfolio containing exactly one operator is optimal, and specifically A) = A/.i = 1)}. ■ 

The implication of Theorem 58 is that once parameters 0 have been fixed, then best performance is 
ensured by using a homogeneous initialization portfolio. Some readers might regard Theorem 58 as a negative 
result for initialization portfolios and might ask why we consider them in the first place. Here is why. First, 
Theorem 58 could not have been derived unless an initialization portfolio was assumed. Second, one 
is typically interested in performing inference over distributions of BNs, not one particular BN. And for 
different BNs, different initialization algorithms often turn out to be optimal, and the use of an initialization 
portfolio supports the desired flexibility in the use of initialization algorithms. Third, even though one 
initialization algorithm among many is optimal, it is often not a priori obvious which one it is. So there 
is a pre-processing phase, during which different initialization operators are used, and the existence of an 
initialization portfolio enables this pre-processing phase. 


5.4 Analysis of Search 

We now turn our attention to SLS operations and assume that MAX-TRIES = oo and that the initialization 
portfolio has been fixed. In OperatorSGS an operation is executed when an operator is picked from either 
A 5 or A / and then run. In SimpleSGS an operation is executed when Initialize is called in the loop of 
tries and when SEARCH is called in the loop of flips. 

The expected number of operations executed, including both initializations and local search steps, is 
characterized by the following result (see [63] for the derivation). 


Theorem 59 (Expected number of operations) Suppose that Pr(i?(m) < m) > 0 and let MAX-FLIPS 
= m. The expected number of SLS operations executed, E(R(m)), is given by 


E(R(m)) 


m{ 1 — Pr(F(oo) < m)) + Y'iLo * Pr(F(oo) = i) + 1 
Pr(F(oo) < m) 


(35) 


Theorem 59 is a generalization of previous results [70, 76] in that it accounts for all operators including 
initialization operators in Ag. 

Our analysis in the rest of this section assumes an SLS model (At, O), where Ad = ( S , V, V) is a Markov 
chain. Further, T is (as before) a random variable representing Ad’s hitting time. The benefit of a Markov 
chain analysis is illustrated as follows. Suppose that we vary only the A g part in 6, and further suppose that 
A s contains only two operators, namely a noisy operator (applied with probability pn ) and greedy operator 
(applied with probability pc = 1 ~Pn )■ In related work we have shown that the expected hitting time h as 
a function of pn, h{p jy), are rational functions, or in other words ratios of polynomials P{pn ) and Q{pn)'- 
h{pN) = P(pn)/Q(pn) [58]. The curves for h(piy) are analytical counterparts to and explain theoretically 
the so-called noise response curves that have been extensively studied empirically [58]. The impact of noise 
varies with BN hardness, with optimal noise level increasing with increasing problem hardness. 

Analogously, we consider here hitting time as a function of additivity pa , with corresponding (empirical) 
additive response curves investigated in Section 6. Consider an SLS model (Ad, O), where the Markov chain 
Ad = ( S,VfP ) is defined over a bitstring of length n and with parameter Pa- Let k = n- 1-1, assume optimum 
states {sa, • • • , s K } = O where A < k, and form a system of equations for Ad’s expected first passage times 
rrii for 1 < i < k. There exists an equivalent upper triangular system Um = b, where m = (mi, . . . ,m a-i) t , 
in which all coefficients in U and b are rational functions of pa- Performing back substitution on Um = b , 
we obtain the following expected hitting time result. 


Theorem 60 (Rationality of hitting time) Consider an SLS model (Ad, (A), where Ad = ( S,V,V ) is an 
augmented Markov chain defined over a bitstring of length n and with parameter pa ■ The expected hitting 
time for Ai is a rational function of pa, h{pA) = P(pa)/Q{pa), where P{pa) and Q(pa) are polynomials. 

The proof of Theorem 60 is similar to that of a previous result [58], except that (i) it concerns pa rather 
than Pm, and (ii) is based on our approximate (see Definition 54) rather than an exact Markov (see Definition 
49) chain model. In the proof of the theorem, the key idea is to perform Gaussian elimination symbolically, 
such that the parameter pA is preserved throughout the derivation. 

Theorem 60, which is concerned with a single BN, can be extended in a natural way to multiple BNs 
by means of finite mixture distributions. Again, we consider additive probability to be the independent 
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parameter. Because of closure properties [58], it follows from Theorem 60 that the SLS hitting time for a 
mixture of BNs is also a rational function H{pa) = P(pa)/Q(Pa )• To approximate the rational functions 
induced by single BNs as well as multiple BNs we have used low-order polynomials [58]; this approach is 
motivated by Weierstrass’ theorem and supported by empirical results. 


6 Experiments 

We now turn to experiments performed with the SGS system, which implements the SimpleSGS and 
OperatorSGS algorithms [55,58,61,63]. In these experiments, we attempt to carefully balance scientific 
and competitive experimentation [30]. In other words, we aim to complement our analytical results, and in 
particular improve the understanding of our portfolio approach as well as that of additive utility and gain 
(scientific experimentation [30]). For two reasons, we study SGS in depth rather than investigate a large 
number of SLS systems at a more superficial level. First, apart from SGS we know of no SLS system for 
MPE computation that implements the additive approach. Second, there is clear evidence that a problem 
instance that is hard for one SLS algorithms is also hard for other SLS algorithms [34]. 

On the competitive experimentation side, we investigate the question of how our SLS algorithms, as 
implemented by SGS, compare to exact algorithms, in particular clique tree clustering as implemented in 
the state-of-the-art tree clustering system Hugin [14,45]. Hugin uses two phases to compute an MPE, a 
compilation phase and an execution phase. Hugin was chosen since (i) we wanted to employ an exact 
algorithms to compute MPEs (as opposed to approximations) and (ii) Hugin uses one of the best exact 
methods, the tree clustering algorithm. In our experiments, we used a Las Vegas methodology, motivated in 
Section 5.1, which has also been used extensively in experimental research on satisfiability [65]: Hugin was 
always used before SGS, in order to find an MPE x* if one existed. The reason for focusing on instances 
which Hugin could process is as follows. For these cases we know for sure that some non-zero MPE x* 
exists and can therefore input U m \ n = Pr(a:*) to SGS to force convergence to an MPE. In addition, and 
following what is typically done in experiments, we assume that Hugin’s compilation time as well as SGS’s 
parameter optimization time can be amortized over a large number of MPE queries to a BN. 

In the rest of this section we provide experimental results for SGS aud compare SGS to Hugin. We use 
synthetic BNs in Section 6.1 and BNs from applications in Section 6.2. For the experiments discussed here, 
a Dell Dimension 4500 CPU, 2GHz Intel Pentium IV using 1GB of RAM and up to 2GB of swap space has 
been used. The computer was running Windows XP. 

6.1 Experiments with Synthetic Bayesian Networks 

We now compare SGS to Hugin, using synthetically generated BNs. The experiments reported in this 
section focused on varying the C/V - ratio in BNs, using synthetic networks where the hardness of MPE 
computation can be controlled. These results provide insights into the general patterns of performance of a 
stochastic local search algorithm such as SGS compared to a well-established baseline. 

Section 6.1.1 outlines the methodology used. In Section 6.1.2 we focus on the effect, on Hugin as well 
as on SimpleSGS and OperatorSGS, of varying the C/V - ratio in BNs. In Section 6.1.3 we investigate 
additive gain using OperatorSGS. 

6.1.1 Synthetic Networks: Methodology 

In this section, we briefly discuss a paradigm for the generation of increasingly hard BNs for inference, namely 
the BPART algorithm [55,62], The BPART algorithm extends research on generating hard instances for 
the satisfiability problem [65], and we are exploiting the close relationship between computing an MPE 
and finding a satisfying truth assignment of a corresponding CNF formula [8,73,81]. Generating problem 
instances at random can result in very easy problems [65]. By carefully manipulating one or more BPART 
input parameters, one can construct BNs that existing tree clustering inference algorithms cannot handle 
due to an approximately exponential increase in clique tree size [57,62]. The main quantity we vary here is 
the ratio C/V, where C and V are BPART input parameters representing the number of leaf aud and root 
nodes in a BN respectively. The C/V - ratio is perhaps easiest to motivate using bipartite BNs for medical 
diagnosis, where V is the number of diseases and C is the number of symptoms. Clearly, it is interesting 
to understand how the speed of MPE computation varies when the ratio of symptoms to diseases is varied, 
or in other words as the C/V - ratio is varied. For the experiments reported here we set BPART’s input 
parameters as follows, generating SAT-like BNs [62]: The CPT type of the root nodes was Q = uniform; the 
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BNs 

C/V 

Total clique tree size (in 1000) 

Size of maximal clique (in 1000) 

Mean 

StDev 

Median 

Max 

Min 

Mean 

StDev 

Median 

Max 

Min 

2.0 

201.7 

95.17 

172.1 

513.3 

53.34 

70.94 

44.49 

65.54 

262.1 

16.38 

2.2 

388.2 

211.1 

337.9 

1,212 

53.60 

150.6 

109.0 

131.1 

524.3 

16.38 

2.4 

573.8 

288.3 

504.5 

1,628 

131.8 

238.6 

157.1 

262.1 

1,049 

65.54 

2.6 

852.7 

449.1 

718.8 

2,302 

190.9 

329.6 

226.1 

262.1 

1,049 

65.54 

2.8 

1,285 

789.0 

1,158 

5,463 

278.5 

503.3 

356.3 

524.3 

2,097 

131.1 

3.0 

1,881 

991.7 

1,651 

6,506 

580.9 

720.9 

331.4 

524.3 

2,097 

131.1 

3.2 

2,559 

1,241 

2,337 

6,468 

607.6 

1,077 

669.5 

1,049 

4,194 

262.1 

3.4 

3,779 

1,807 

3,504 

11,204 

689.1 

1,565 

944.4 

1,049 

4,194 

262.1 


Time, all propagations (seconds) 

Time, one propagation (seconds) 

Mean 

StDev 

Median 

Max 

Min 

Mean 

StDev 

Median 

Max 

Min 

2.0 

1.119 

0.5590 

1.039 

3.234 

0.328 

0.07038 

0.03405 

0.06491 

0.1902 

0.01726 

2.2 

2.123 

1.220 

1.860 

7.047 

0.375 

0.1404 

0.08568 

0.1236 

0.5560 

0.02344 

2.4 

3.102 

1.758 

2.696 

9.953 

0.891 

0.2291 

0.1345 

0.1952 

0.7864 

0.04689 

2.6 

4.809 

3.234 

3.898 

24.55 

0.718 

0.3655 

0.2374 

0.3015 

1.444 

0.1000 

2.8 

6.349 

4.249 

5.258 

25.64 

1.234 

0.5706 

0.4164 

0.5137 

3.205 

0.1198 

3.0 

8.267 

5.039 

7.008 

36.31 

1.782 

0.8618 

0.4744 

0.7428 

3.026 

0.2206 

3.2 

12.76 

8.332 

10.79 

41.91 

2.875 

1.293 

0.6811 

1.120 

3.224 

0.2756 

3.4 

14.67 

9.867 

9.781 

66.5 

2.797 

1.864 

1.059 

1.738 

6.045 

0.3108 


Table 2: Clique tree statistics (top half) and propagation times (bottom half) for Hugin on synthetic, 
bipartite Bayesian networks. Parameters characterizing these BNs are the number of root nodes V and the 
number of leaf nodes C. 


CPT type of the non-root nodes was F = or; the number of root nodes was V = 30; the number of states per 
node was S — 2; the number of parents per leaf node was P = 3; and irregular BNs were created by setting 
R = false. Here, “irregular” refers to the fact that each root node has an irregular or randomly varying 
number of children. We varied the number of leaf nodes while keeping V — 30 constant, giving C/V - ratios 
ranging from C/V = 2.0 to C/V = 3.4. 

Our main focus in the following is the comparison of the execution phase of Hugin with that of SGS 
(using approximately optimal parameter settings) when computing an MPE. Our methodology is to generate 
a number of BN instances according to the BPART construction, run SGS and Hugin on these instances, 
and record statistics for the time it takes to compute an MPE. We do not report results for BNs that Hugin 
was not able to process, even in cases where SGS might have been able to find an MPE. For example, 
we did not experiment with as large SAT-like BNs as the SAT formulas used in earlier research [65], since 
Hugin was not able to process these large networks for interesting C/V- ratios. Consequently, our results do 
not include the region where C/V s=s 4.25, which is SAT’s phase-transition region [65]. 

Even though these SAT-like BNs have a clear mapping to SAT instances, a few crucial differences make 
our BN experiments different from previous experiments in the SAT setting. Our BN data structures are 
different, since they are able to represent arbitrary CPTs, not just the logical true or false values that arise 
in pure logical inference. Along similar lines, our gain and utility computations are generalizations of their 
logical counterparts as discussed in Section 3 and Section 4. They need to handle the general, real-valued 
case as encountered in CPTs, not just the special, integer-valued case of logical inference. 

6.1.2 Synthetic Networks: Hardness and the C/V ratio 

The purpose of the experiments reported here was to compare the performance of SGS, both SimpleSGS 
and OperatorSGS, with that of Hugin as the C/V - ratio was varied. For each C/V level, using V = 30 
and for C/V = 2.0 to C/V = 3.4, 100 BNs were generated using the BPART algorithm and processed using 
Hugin and SGS. 

Clique tree results for Hugin are summarized in Table 2; each row shows statistics for BPART instances. 
There are dramatic increases in total clique tree size as well as in the size of the largest clique as C/V increases. 
The rapid growth of the total clique tree size with C/V causes MPE computation times to grow rapidly as 
well. In Table 2, sample means and standard deviations of MPE computation times for Hugin are presented. 

We also experimented with both variants of SGS, namely SimpleSGS and OperatorSGS, with para- 
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meters set as follows: MAX-TRIES = oo, U = Ua, C = Cq, and U m [ n = 75 (for C = 60) to U m i„ = 117 
(for C = 102). MAX-FLIPS and p N were set to approximately optimal levels, varying with C/V. For both 
variants, an MPE was found for all problem instances. 

In Figure 9 the results from these experiments are shown along with exponential regression lines of the 
form y = ae bx where a, b € ffi. and x = C/V. For SimpleSGS and OperatorSGS, each data point in the 
figure represent a total of 10,000 experiments. This is because there are 100 BNs per C/V level, and each 
SGS variant was executed 100 times for each BN. The exponential regression results in Figure 9 show that 
the computation times of Hugin and SGS increase with x = C/V in a fashion that is well-approximated 
by exponential curves. For Hugin, this confirms earlier results [57,62], For Hugin, we note also that the 
execution phase consists of several propagations when there is more than one MPE [50]. The existence of 
multiple MPEs for this C/V range is reflected in the obvious difference between y = 0.03847e 1797a: (for all 
propagations) and y = 0.000870e 2 289a: (for one propagation) in Figure 9. 

Figure 9 shows that both SimpleSGS and OperatorSGS clearly outperform Hugin. Specifically, for 
MPE computation we note that SimpleSGS and OperatorSGS on average are more than two orders of 
magnitude faster than Hugin (the “all props” regression line in Figure 9) on these BPART BNs where 
2.0 < C/V < 3.4. Even if Hugin had always only needed one propagation (a best-case scenario reflected by 
the “one prop” regression line in Figure 9) in order to compute an MPE, which clearly is the lower bound, 
both SGS variants are on average well over one order of magnitude faster than Hugin for 2.0 < C/V < 3.4. 
The run time curves of SimpleSGS and OperatorSGS in Figure 9 are quite similar and in fact it is difficult 
to tell them apart. A potential disadvantage of OperatorSGS compared to SimpleSGS, namely slower 
speed due to computational overhead associated with operator selection and application, is in other words 
minimized in our current implementation. 

6.1.3 Synthetic Networks: Investigation of Additive Gain for Varying C/V 

The goal of this set of experiments was to investigate different OperatorSGS variants. In particular, 
we wanted to investigate both multiplicative gain and additive gain in SAT-like BNs, generated using the 
BPART algorithm as discussed in Section 6.1.2. Since the additive measure is tailored to SAT-like BNs, one 
might expect SGS with A 5 = {(AG, pag), (NU, Pnu)} to outperform SGS with Ag = {(MG, pmg)> (NU, 
Pnu)} in this experiment. However, SGS with Ag = {(MG, Pmg)> (NU, Pnu)} is augmented with a “force 
state” heuristic which handles deterministic nodes, as shown in Figure 3, and the question is how helpful 
this heuristic is compared to the additive measure. In this set of experiments, we again used V = 30 root 
nodes, and considered C = 60 and C = 102. 

Results of the experiments are shown in Figure 10. In the two panels to the left in the figure, we show 
results for OperatorSGS with A 5 = {(AG, pag), (NU, Pnu)}, a portfolio with an additive operator AG. 
In the two panels to the right, we show results for OperatorSGS with A 5 = {(MG, Pmg)> (NU, Pnu)}> 
a portfolio with a multiplicative operator MG. Results for C/V = 2.0 are displayed in the top two panels 
in Figure 10, while the bottom two panels show results for C/V = 3.4. In addition, we used three different 
values for p = Pnu> and consequently for pag = 1— Pnu and pmg = Pnu respectively, and varied 
MAX-FLIPS as shown on the auaxis in Figure 10. 

Our main observations with respect to these experimental results are as follows. For both C/V = 2.0 and 
C/V = 3.4, the additive approach (to the left in Figure 10) is the winner over the multiplicative approach 
in the sense of having the parameter settings that minimizes the number of flips. In addition, the additive 
approach is more robust in the sense that there is good performance for a wider range of parameter values. 
This experiment provide an experimental rationale for the “artificial” additive approach, which is used in 
OperatorSGS with Ag = {(AG, pag)> (NU, pnu)}> compared to the multiplicative approach used in 
OperatorSGS with A s = {(MG, pmg), (NU, pnu)}- 

6.2 Experiments with Application Bayesian Networks 

This section reports on SGS experiments using BNs from applications, and highlights the system’s per- 
formance when using different initialization portfolios, different search portfolios, and different measures of 
gain. The OperatorSGS algorithm was to a large extent motivated by our early work with these BNs, 
which unlike the synthetic instances investigated in Section 6.1 have no restriction on topology, number of 
states, number of parents, or CPTs. 

In Section 6.2.1 we outline the methodology and briefly discuss the BNs used in experiments. Section 6.2.2 
empirically compares SGS and Hugin. Section 6.2.3 covers the effect of using different search operators in 
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Bayesian 

network 

Nodes 

n 

States 

avg. 

Conditional probability table (CPT) values 

0 

(0, 10— J 

(10“ J ,10--J 

(lO'ClO- 1 ) 

(io-% 1) 

1 

Muninl 

189 

5.26 

56.0% 

3.64% 

3.87% 

7.99% 

18% 

10.4% 

Munin2 

1003 

5.36 

55.53% 

4.74% 

5.08% 

9.46% 

17.42% 

7.74% 

Mildew 

35 

17.6 

93.06% 

0.01% 

0.75% 

2.07% 

3.93% 

0.15% 

Water 

32 

3.63 

51.7% 

0.04% 

1.4% 

12.7% 

31.8% 

2.35% 


Table 3: Information on Bayesian networks from applications. Parameters characterizing the BNs are the 
number of nodes n and the average number of states per node. Conditional probability table values are also 
shown. These BNs have a large number of zeros in CPTs, thus for a large number of explanations x have 
Pr (a:) = Ub(x) = 0. In these highly deterministic Bayesian networks, the additive approach turns out to 
reduce run time when used in stochastic local search. 


BN 

Hugin clique tree statistics 

Hugin execution times (sec) 

Sum 

Max 

Median 

Compile 

Execute 

Total 

Prop. 

Muninl 

384,620,599 

288,000,000 

300 

6,058.2 

4,057.8 

10,114.8 

4,057.8 

Munin2 

4,861,824 

504,000 

320 

2.089 

2.445 

4.535 

0.815 

Mildew 

9,566,232 

4,372,480 

8,600 

1.0671 

3.284 

4.352 

0.821 

Water 

3,657,180 

1,769,472 

2,187 

0.616 

0.715 

1.330 

0.715 


Table 4: Performance of Hugin on application Bayesian networks. There are clique tree statistics, and 
the results in the Total column are for computing an MPE and are in seconds. The Compile and Execute 
columns give the time taken in each of these two phases. 


OperatorSGS, while Section 6.2.4 covers the effect of varying MAX-FLIPS in OperatorSGS. 

6.2.1 Application Networks: Methodology 

The BNs investigated here, most of which are taken from Friedman’s Bayesian Network Repository, are 
denoted Muninl, Munin2, Mildew, and Water. (At the time of this writing, the location of the Bayesian 
Network Repository is http://www.cs.huji.ac.il/labs/compbio/Repository/.) Given our emphasis on 
additive utility and gain derived from MAXSAT, we focus on BNs with a substantial number of zeros; see 
Table 3. 

The Munin BNs, Muninl and Munin2, are medical BNs from the field of electromyography. The Muninl 
BN has, among the BNs we consider here, the largest total clique tree size and the slowest compile and per- 
propagation execution time. The Water BN models the biological processes of water purification, while 
Mildew is for making dosage recommendations regarding the amount of fungicide needed to fight mildew in 
wheat. For each of these BNs, statistics for the CPT values along with the number of nodes and the average 
number of states per node are presented in Table 3. 

For these application BNs, we first ran Hugin. Following the methodology presented in Section 6.1.1, 
for each BN the resulting MPE probability Pr(ai*) was then used as utility limit U m ; n = Pr(ai*) for SGS. 
This approach makes our comparison between algorithms more informative as discussed in Section 5.1. 

6.2.2 Application Networks: SGS and Hugin 

The goal of the experiments we report on now is to compare the performance of OperatorSGS and Hugin 
using BNs from applications. Results for Hugin are shown in Table 4; the fill in weight triangulation heuristic 
was used. In Table 4, all Hugin timing results are averages of 30 experiments, except for Muninl which is 
the average of 5 experiments. 

Results for OperatorSGS are shown in Table 5. We use approximately optimal parameter settings for 
OperatorSGS. Reflecting our analytical result in Theorem 58, OperatorSGS uses homogenous initial- 
ization portfolios with the forward simulation (FS) and forward dynamic programming (FDP) initialization 
operators here. Further experiments investigating the impact of varying the search portfolio and MAX-FLIPS 
on the performance of OperatorSGS can be found in Section 6.2.3 and Section 6.2.4. 

Comparing the results for Hugin and OperatorSGS in Table 4 and Table 5 respectively, we make the 
following observations. First, OperatorSGS is highly competitive with Hugin for these BNs. Particularly 
impressive is the performance of OperatorSGS for Muninl and Water, where the computation time of 
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BN 

OperatorSGS input parameters 

Computation times | 

Initialization 

Search 

MAX-FLIPS 

PA 

Time (sec) 

Flips 

Muninl 

Pfs = 1.0 

Pmn = 0.05, pmg = 0.05, PAG = 0.45, pan = 0.45 

30 

0.9 

0.02537 

25.37 

Munin2 

Pfs = 1.0 

PAN = 0.5, PAG = 0.5 

70 

1.0 

1.385 

245.5 

Mildew 

Pfdp = 1-0 

Pmn = 0.336, pmg = 0.504, pan = 0.064, pag = 0.096 

200 

0.16 

0.2404 

193.3 

Water 

Pfs = 1.0 

PAN = 0.1, PAG = 0.9 

10 

1.0 

0.00625 

61.71 


Table 5: Performance of OperatorSGS on application Bayesian networks. The values of approximately 
optimal input parameters, including initialization and search operators, as well as results for computing an 
MPE are shown. Note how p A , or additive probability, is non-zero in all cases. 


OperatorSGS (measured in seconds) is substantially better than Hugin’s. For Muninl, one Hugin prop- 
agation takes 3,912.4 seconds while OperatorSGS computation time is 0.02537 seconds. For Water, one 
Hugin propagation takes 0.715 seconds while OperatorSGS computation time is 0.0625 seconds. Second, 
for three of the BNs in Table 5, approximately optimal OperatorSGS performance is obtained for additive 
probability pa > 0.9, thus highlighting the power of our additive approach. 

6.2.3 Application Networks: Different Search Operators 

In this set of experiments with OperatorSGS, we investigated different search operators by varying the 
search portfolio A g. The purpose of these experiments was to investigate the effect of varying additive 
probability pa and noise probability pn ■ Appropriate use of noise has been found to be a powerful way to 
improve SLS performance [20,31,34,54,58,77,78]; however we know of no previous experiments where the 
effect of varying additive probability pa is systematically investigated. To do so, we considered the following 
search portfolio. 

Definition 61 (Search portfolio) The search portfolio A s(pn, Pa) is parametrized by noise probability 
Pn and additive probability pa and is defined as 

A s{Pn,Pa ) := {(AN, paPn) , (AG, p A (l - p N )) , (MN, (1 - pa)Pn) , (MG, (1 -pjv)(l - Pa))} ■ 

By introducing A s{pni Pa)i we reduce the three-dimensional problem of varying probabilities in the 
portfolio As(pan, PAG) Pmn) from Definition 46 into a two-dimensional problem of varying probabilities pn 
and pa- Reflecting Theorem 58, a BN-specihc optimal initialization algorithm — either forward simulation 
[28] or a randomized variant of the Viterbi algorithm [63, 83] — was used in A / for each application BN. 
Forward simulation was used for Muninl, Munin2, and Water. Mildew was initialized using the forward 
variant of the randomized Viterbi algorithm. BN-specihc values for MAX-FLIPS were also used: MAX- 
FLIPS = 30 for Muninl, MAX-FLIPS = 100 for Munin2, MAX-FLIPS = 10 for Water, and MAX-FLIPS 
= 200 for Mildew. 

For each application BN, the probability of using an additive operator, p A , was varied from pa — 0 
to pa = 1 in increments of A p A = 0.1. Noise probability pw was varied from pn = 0.1 to pjv = 0.9 in 
increments of A pjy = 0.2. We call these and similar run time curves generated by varying the probability 
p A “additive response curves”, since they are similar to noise response curves generated by varying the 
probability pn [31,54]. Results, in the form of sample means and piecewise linear approximations for v{jpa), 
are reported in Figure 11. Each data point represents the mean of 1,000 experiments. 

In Figure 11, the case pn = Pg = 0.5 represents the situation where greedy and noise search operators 
are applied with equal probability. Further, at p A = 0 only the operators MN and MG are applied, while at 
Pa — 1 only AN and AG are used. Previous experimental results are at p A = 0 if they only use multiplicative 
gain and not additive gain in their search heuristics; we will shortly see the disadvantage of this traditional 
approach. 

Some of the main points found in Figure 11 are as follows. One main novel result is that varying p A 
has, in many cases, a striking impact on SLS run time. Specifically, by using p A = 0 as is employed in most 
traditional SLS approaches to BN inference, one may obtain far from optimal performance. The impact 
depends on the particular BN and also varies dramatically with noise level pjv- 

In Figure 11, the shapes of the additive response curves for Water, Muninl, and Munin2 are quite similar: 
Optimal levels of p A , P A , appear to be found at high values of p A , and for such values the impact of noise 
is relatively minor. Combining high noise probability pn and low additive probability p A has, on the other 
hand, a very negative impact on run time. Examples of this can be found for p^ = 0.7 and p A = 0.1 in 
Figure 11 for Water, Muninl, and Munin2. Robust OperatorSGS performance, with respect to variations 
in pjv, is ensured by using a high value for pa- 
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The shapes of the OperatorSGS run time curves for Mildew are somewhat different than those for 
Water, Muninl, and Munin2. Here we find, in all cases, monotonic decreases in run times for small values 
of pa and monotonic increases in run times for large values of pa- These convex empirical curves suggest 
an optimal value for p* A that is increasing with pn- A relatively high level of noise improves the run time on 
Mildew; pn = 0.5 gives best results in these experiments, with an average run time of 209.8 flips at p* A = 0.2. 

6.2.4 Application Networks: Impact of Restart 

In this section we investigate the impact, on OperatorSGS performance, of varying the restart parameter 
MAX-FLIPS along with pa and pn- We experiment with Mildew since it has the most challenging (and 
interesting) additive response curve in Figure 11. Similar to in Section 6.2.3, we employ the search portfolio 
As(pn, Pa)- In order to further investigate the region where minimal average run time occurred in Figure 11, 
we varied additive probability pa from pA =0.1 to pa = 0.3 in increments of A pa = 0.02. Noise probability 
Pn was varied from p n = 0.3 to Pn = 0.6 in increments of A pn = 0.1. 

Results, in the form of sample means and piecewise linear approximations for r (pa), are reported in Figure 
12. Each data point represents the mean of 1000 runs. In each panel of Figure 12, varying MAX-FLIPS has 
a substantial impact. Generally, MAX-FLIPS = 50 gives the fastest run time. Further, careful optimization 
of pa, Pn, and MAX-FLIPS gives significant benefit, and the approximate performance optimum is at pa = 
0.17, pn = 0.4, and MAX-FLIPS = 50. We also note that most of these curves are convex or almost convex, 
lending support to a hypothesis that the underlying analytical curves are convex rational functions. 


7 Related Work and Discussion 

The problem of computing a most probable explanation (MPE) in Bayesian networks has been addressed 
using exact algorithms [2,37,45,47,71,80,86] as well as inexact algorithms [39,41,48,55,61]. In addition, 
there is related work on SAT and MAXSAT [27,35,54,76,78,79] as well as on weighted MAXSAT [10,19, 
29,42,44,46,67,75,84], Since a comprehensive literature review is well beyond the scope of this article, we 
only discuss the previous research most closely related to our work in the rest of this section. 

7.1 Portfolio Algorithms 

Our work is related to research on hybrid or portfolio algorithms [25,26,38,85]. Huberman, Lukose, and 
Hogg investigate hard computational problems and how heuristic algorithms, often randomized, have been 
developed to solve such problems [38]. Their main emphasis is on portfolios of independent Las Vegas 
algorithms, where run times can be described using probability distributions, and consequently one can 
form expectation and variance (or standard deviation). By considering the 2-dimensional space spanned by 
expectation and standard deviation, an efficient frontier an be formed, similar to what is done in economics, 
based on varying the fractions of CPU-time allocated to different Las Vegas algorithms in a portfolio. 
Empirically, the NP-hard problem of graph coloring was studied using the Brelaz heuristic. By taking a 
portfolio approach, it was found that expected performance could be increased by 30% while also reducing 
risk (or standard deviation) [38]. While the main emphasis of Huberman, Lukose, and Hogg is on completely 
independent algorithms, they also discuss cooperating algorithms which is our main focus in SGS. 

Gomes and Selman note that performance profiles vary dramatically among different algorithms over 
different problem instances [25]. In response, they investigate portfolios of algorithms, and find that it can 
be beneficial to combine algorithms with high run time variance in portfolios. Gomes and Selman consider 
one or more Las Vegas algorithms running independently and in parallel on multiple processors. They 
investigate constraint satisfaction and mixed integer programming problems empirically, and conclude that 
optimal portfolio design strongly depends on details of the run time distribution. 

Lagoudakis and Littman formalize algorithm selection as a Markov decision process, and investigate 
a reinforcement learning approach to learning the value function [43]. Algorithms are selected from the 
portfolio in a cooperative, not independent, manner. Experimentally, they obtain promising results on 
algorithms for sorting and order statistics selection. Xu et al. integrate a learning approach with SAT 
algorithm portfolios [85]. All algorithms in the portfolio execute all problem instances in the training set in 
order to learn empirical hardness models. Learning is performed using features that characterize the problem 
instances, and learned ridge regression models are used to predict run times for individual problem instances. 
During test, a problem instance is solved by the SAT algorithm predicted to be fastest, hence there is no 
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cooperation between algorithms. Using this approach, strong results in the 2007 SAT competition have been 
reported [85]. 

7.2 Stochastic Local Search for SAT Computation 

We distinguish between local search algorithms that rely on the use of a history or memory (of search)- 
and those that do not. For example, tabu-search [24, 68] and guided local search [39, 64, 67] are both 
meta-heuristics that introduce histories into local search. Roughly speaking, guided local search modifies 
the objective function during search as a tactic to escape states that are local minima, while tabu search 
dynamically maintains a list of search space states that it does not immediately re-visit. While the use 
of history has been found to be powerful, it also significantly complicates analysis, and we here emphasize 
algorithms that do not rely on histories. 

We first consider related research on stochastic local search for solving the satisfiability problem (SAT) 
[27,35,54,76,78,79]. An early contribution was the GSAT algorithm, which searches for a satisfying 
truth assignment in a CNF formula [79]. GSAT is controlled by the number of local search steps before a 
restart, MAX-FLIPS, and the number of restarts, MAX-TRIES [79]. A major contribution of GSAT was its 
“sideways moves” ; the algorithm continues to flip variables even when the number of satisfied clauses stays 
the same. Experiments showed that GSAT outperformed the Davis-Putnam (DP) algorithm on synthetic 
instances [65], and strong performance on other problems including graph coloring, the N-queens problem, 
and Boolean induction were also reported [79]. 

The GSAT algorithm has been extended in a number of directions (see [35] for a summary). For instance, 
the three heuristic techniques of clause weighing, averaging in of previous near solution, and (mixed) random 
walk have all been found to improve the performance of basic GSAT [77]. Mixed random walk, which is 
also known as WALKS AT [54,78], combines random walk and greedy local search and is more focused 
than random noise in that it applies noise only in variables that occur in unsatisfied clauses. The main 
conclusion of an empirical evaluation was that mixed random walk is superior to simulated annealing and 
random noise in a wide range of cases [78]. Further flip selection strategy improvements — such as WSAT- 
G [54], WSAT-B [54], Novelty [54], Novelty-1-, and DLM [76] have followed, along with progress in noise 
adaptation [20,31,54]. 

7.3 Stochastic Local Search for MPE and MAP Computation 

We now discuss related research using stochastic search techniques to compute MPE or MAP in Bayesian 
networks [41,48,55,61,68,69,72], Pearl argued early on that stochastic simulation, also known as Gibbs 
sampling, can be used for computing the MPE although the algorithm was primarily intended for belief 
updating [72, p. 216, p. 262]. In stochastic simulation, evidence nodes are clamped, and then random 
samples are created by randomly picking among a node’s states based on their probabilities as given the 
node’s Markov blanket. 

An early investigation of MPE computation by means of stochastic search techniques considers iterative 
local search (ILS), simulated annealing (SA), and genetic search (GS) [48]. ILS is a probabilistic hill-climbing 
that combines next-ascent and random-mutation hill climbing. ILS stops on (local) maxima, however since 
it iterates, it typically finds many maxima, one of which might be an MPE. For the empirical study of these 
algorithms a bipartite BN version of the QMR medical knowledge base, QMR-DT, was used. Experiments 
with ILS, SA, and GS showed that they converged, in many cases, to an MPE. In terms of computational 
speed, ILS was found to be significantly faster than SA which was significantly faster than GS. However, 
SA was more accurate than ILS and GS. It is speculated that the weaker performance of ILS was due to 
the local search getting trapped in the plethora of local maxima [48] . 

Kask and Dechter empirically investigated MPE computation using stochastic simulation [41], and con- 
cluded that it does not perform as well as a greedy approach or as an approach where stochastic simulation 
and greedy search is combined. In addition, they found that augmenting the approach with the mini-bucket 
algorithm for initialization was important, which is in line with our results. We note that both stochastic 
simulation and greedy search are operators in OperatorSGS, such that combining greedy search and sto- 
chastic simulation, as well as additional algorithms, within the SGS framework is easy. For the purpose of 
computing MAP, which generalizes MPE, Park and Darwiche investigated a total of 11 SLS algorithms and 
found that they performed very well, in particular when MPE-based initialization was used [69] . 
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7.4 Weighted MAXSAT and Bayesian Networks 

Recently, much progress has been made in developing solvers for the weighted MAXSAT problem [19, 
29,42,44,46,84]. An early algorithm was Weighted WalkS AT [42], a generalization of WalkSAT. 
It was emphasized how the weighted MAXSAT problems could encode hard and soft constraints in discrete 
optimization problems, including NP-complete problems such as network Steiner tree problems. Network 
Steiner tree problems are concerned with finding paths in graphs. Using an approximate encoding of 
Steiner tree problems, for which much of the numeric information can be represented using clause weights, 
Weighted WALKS AT was shown to produce strong empirical results [42]. The GRASP algorithm also 
solves weighted MAXSAT using local search. Specifically, GRASP uses greedy randomized adaptive search, 
computing locally optimal solutions and using path relinking techniques to improve search [19]. 

The weighted MAXSAT problem can also be solved using exact (complete) algorithms, often based 
on the DP [13] and DPLL [12] algorithms. The MaxSolver uses heuristic strategies for DPLL, giving 
strong performance in experiments with random problems instances as well as problems instances from 
applications [84]. Other weighted MAXSAT solvers, for which details are beyond the scope of this article, 
include the MaxSatz [46], MiniMaxSAT [29], and MaxDPLL [44] solvers. 

There is a close connection between weighted model counting and MPE computation [10,67,75], and 
the use of weighted MAXSAT solvers to solve the MPE problem and the use of MPE solvers to solve the 
weighted MAXSAT problem has been investigated [67,75]. Park encodes the MPE problem as a weighted 
MAXSAT problem, uses weighted model counting algorithms to compute answers to probabilistic queries, 
and finds that the incomplete guided local search (GLS) algorithm performs very well [67]. 

7.5 Other Related Work 

Other related work includes exact BN algorithms such as clique (or join) tree propagation [2,40,45,80], 
conditioning [9,71], variable elimination [15,47,86], arithmetic circuit evaluation [6,11], and AND/OR 
search [17]. The compilation paradigm, which underlies clique tree propagation and arithmetic circuit 
evaluation, is well-suited to resource-bounded and real-time settings [56,66], and has found application 
in sensor validation and diagnosis of electrical power systems in aerospace vehicles [59,60]. Recently, a 
connection between BNs and nrulti-linear functions has been made [10,11], supporting the compilation of 
BNs into arithmetic circuits [6, 10, 11]. The compilation of BNs into arithmetic circuits may rely on encoding 
of a BN into a CNF formula [10], which has been shown to take advantage of determinism [4] as well as 
other local structure in BNs [5]. Chavira and Darwiche encode a BN in the form of a weighted CNF 
theory, and investigate the effect of search versus compilation; different encodings; and local structure and 
evidence [5,7]. 

7.6 Discussion and Comparison 

We now discuss how our SGS approach is similar to and different from related research, and also how 
OperatorSGS and SimpleSGS compare. SimpleSGS was primarily inspired by previous seminal research 
on stochastic local search for SAT, in particular the GSAT and WalkSAT algorithms [78,79]. SGS is also 
related to ILS and SLS, and extends ILS by applying to a wider range of BN topologies, using the additive 
measure, and using more advanced initialization operators before local search commences. We also note that 
the random- mutation hill climbing of ILS gives an effect similar to that of SGS noise operators. SLS [41], 
which was developed independently of SGS, shares several characteristics with SGS, such as stochastic 
steps and initialization that goes beyond initialization uniformly at random. However, there are several 
important differences between SGS and SLS, including the initialization algorithms, the fact that SGS can 
use different measures of gain (including additive gain), as well as the fact that SGS has an operator-based 
variant OperatorSGS. 

OperatorSGS makes a stronger distinction between utility and gain than SimpleSGS and other SLS 
algorithms. A key point is that each compound search operator has an associated gain, which is unrelated 
to how an explanation’s overall utility is computed using U. So, for example, some search operators in A s 
can use additive gain A Ua, while the overall utility of an explanation can be computed using multiplicative 
utility Um- This provides greater flexibility than what is present in SimpleSGS and other SLS algorithms. 
Operators that use the additive measure are optimized for the special but often occurring case of deterministic 
nodes, while operators using the multiplicative measure are optimized for the general case of probabilistic 
nodes. SimpleSGS as well as other SLS algorithms can not use both measures in the same invocation of 
the algorithm, while OperatorSGS can. Since many application BNs contain both probabilistic nodes 
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and deterministic nodes, it turns out that using A Ua and A Um in combination gives, in many cases, faster 
computation of MPEs using SGS. See Section 6.2 for supporting experiments with application BNs. 

OperatorSGS can simulate SimpleSGS in the following manner: One can regard the noise and hill- 
climbing steps of SimpleSGS as two operators chosen from a stochastic search portfolio A 5 of size two, 
where one search operator is a greedy operator such as AG, the other is a stochastic operator such as NU: 
A 5 = {(NU, pn), (AG, 1 —Pat)}. And we can regard SimpleSGS initialization as an initialization operator 
chosen from an OperatorSGS stochastic initialization portfolio A/ of size one: A / = {(UN, 1)}. The 
advantage of OperatorSGS compared to standard SGS is generality and flexibility, which typically leads, 
as shown in experiments in Section 6.1 and Section 6.2, to better performance. Further, as detailed in Section 
6.1, the speed of OperatorSGS and SimpleSGS is essentially the same on SAT-like BNs. 

Previous portfolio-based approaches have emphasized running multiple heuristics independently, either 
on the same computer or independently on multiple computers [25,38]. Alternatively, they have picked, 
for a particular problem instance, the best algorithm among a portfolio of algorithms [85]. In comparison, 
there is in OperatorSGS a cooperation between heuristics (similar to [43]), in the sense that initialization 
and search operators work on the same explanation x to compute an MPE estimate x* . There is currently 
no learning in OperatorSGS, unlike in some related work [36,43,74,85]. Finally, we note that our Las 
Vegas analysis and experimental approach is different from most previous efforts in several ways. First, 
we emphasize finding an MPE, and not just getting close to it, and therefore we compare with an exact 
method, the Hugin clique tree clustering algorithm. Second, our synthetic networks are constructed in a 
systematic way, generating instances of varying hardness. This allows us to experimentally show that SGS 
can outperform Hugin on synthetic BN of increasing hardness. 

Comparing SLS algorithms with a broader range of algorithms including those for weighted MAXSAT 
[10,19,29,42,44,46,67,75,84], we find that two broad approaches are emerging: In the encoding approach , 
instances of one problem (say, MPE) can be encoded as another (say, weighted MAXSAT). A current 
view is that MPE and weighted MAXSAT are complementary, where an instance of one problem can be 
encoded as an instance of the other problem, with some loss of effectiveness [75]. In the generalization 
approach , on the other hand, one or more existing algorithms or techniques are generalized to handle a wider 
range of problems. For instance, one may generalize from MAXSAT to weighted MAXSAT (see [42,44]) or 
from MAXSAT to MPE (as in [58] and in this article). The two approaches, encoding and generalization, 
both have their pros and cons. In the encoding approach, there is a separate encoding or compilation 
step that can be performed off-line. This step can thus be amortized over a potentially large number of 
queries to the same problem instance. A disadvantage is that encoding may lead to a blow-up in problem 
instance size. Understandability and simplicity are, on the other hand, facilitated by the generalization 
approach; disadvantages include the lack of an off-line step and ability to take advantage of special problem 
structure. It appears that both approaches merit investigation, and in this paper we generally are taking a 
generalization approach. 


8 Conclusion and Future Work 

Stochastic local search has proven to be powerful algorithms for finding satisfying assignments in satisfiability 
(SAT) instances [27,35,54,76,78,79]. In this article, we discuss the generalization of stochastic local search 
from the logical case of SAT to the probabilistic case of Bayesian networks (BNs). While SLS techniques for 
MPE and MAP computation are well established [41,48,61,68,69], we present here several novel techniques 
and results centering around search and initialization portfolios as well as an additive measure of utility 
and gain. These techniques have been incorporated into the SGS stochastic local search approach, which 
lreuristically computes a most probable explanation (MPE) in a Bayesian network. SGS combines greedy 
search, stochastic search, and stochastic initialization algorithms. We have presented two SGS algorithms, 
namely simple SGS (SimpleSGS) and operator-based SGS (OperatorSGS); the latter is more flexible and 
general than the former due to its initialization and search portfolios. The initialization and search portfolios 
of OperatorSGS contain operators, which are algorithms for initialization or search respectively, along 
with probabilities that control their selection and execution. The OperatorSGS algorithm, for which we 
provide both theoretical and experimental results, is closely related to previous research on SLS algorithms 
for MPE and MAP computation [41,68,69] as well as to work on portfolio (or hybrid) algorithms for other 
computational problems [25,26,38,39,61,85]. 

We carefully formalized the concepts of SLS utility and gain in the context of MPE computation, and 
emphasized our additive approach. The additive measure, for which we provided several new results, is a 
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generalization of GSAT’s and WalkSAT’s utility and gain measures to the probabilistic setting. Application 
BNs often have many deterministic nodes, and the additive measure turned out to be very powerful in 
such BNs. This measure has, to our knowledge, not been extensively studied in previous research on MPE 
computation using stochastic local search, and we also investigated how it relates to traditional multiplicative 
gain as used in other SLS algorithms for MPE. In addition, we introduced a novel Markov chain model, 
augmented random walk, where the behavior of OperatorSGS’s initialization and search operators is 
explicitly represented as Markov chain states. We also showed that the optimal OperatorSGS initialization 
portfolio is homogenous. 

Two sets of experiments have been conducted. We have shown that SGS outperforms Hugin on hard BNs 
constructed from satisfiability instances, and also outperforms Hugin on application BNs with a high degree 
of determinism. In one set of experiments, we have utilized an approach to constructing synthetic Bayesian 
networks of varying hardness [62]. This approach helps in developing an understanding of the suitability of 
different algorithms for different classes of BNs. In these experiments, we found that SGS can outperform 
Hugin by well over one order of magnitude, and in particular that SGS consistently performed one or more 
orders of magnitude better than the state-of-the-art exact algorithm Hugin as the C/V-ratio was varied. In 
addition, we found that the additive measure Ua gave better performance than the multiplicative measure 
Um in partly deterministic, SAT-like networks. 

Experiments with OperatorSGS on application BNs have also been done. We have found that the 
algorithm is quite effective on these networks too, and performs comparably to Hugin. Key factors in 
the success of OperatorSGS are the initialization operators, the additive measure, and the approach of 
using a portfolio of operators for computation. OperatorSGS’s varying approximately optimal parameter 
values, including values for the selection probabilities of the initialization and search operators, highlights 
the importance of the portfolio approach to stochastic local search. 

Areas for current and future work include the following. First, the important role of portfolios and 
different measures of gain, especially in application networks, highlights the opportunity of adaptively tuning 
the probabilities (similar to noise adaptation [20,31,54]) of portfolio operators when computing MPE or MAP 
for a given BN. While our focus here has not been on adaptation or learning during search, we believe that our 
framework and results can also enable innovations in these areas, thereby further enhancing portfolio-based 
stochastic local search algorithms. Both analytical and experimental work on portfolio adaptation would be 
of great interest. Second, in local search, an essential question is how to escape from local minima. In this 
article, our answer has been the use of noise, which is a local escape mechanism. Another approach is to use 
crossover from genetic algorithms. In the MPE or MAP context, crossover would take place between two 
explanations, thus providing a more global escape mechanism than noise. More generally, there is potential 
for additional hybridization, for example combining stochastic local search with clique tree clustering. Third, 
additional investigation of when to terminate, other than by using a Las Vegas approach or time limits, is 
needed in order to increase the easy-of-use of the SGS approach in applications. 
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Figure 9: A comparison of the performance of Hugin, SimpleSGS, and OperatorSGS as C/V is varied. 
In general, Hugin needs multiple clique tree propagations in order to find an MPE, hence results for one 
propagation (“Hugin - one prop”) as well as multiple propagations (“Hugin - all props”) are shown. On 
average, both SimpleSGS and OperatorSGS are more than one order of magnitude faster than one Hugin 
propagation, and more than two orders of magnitude faster than all Hugin propagations. 


Additive, C/V =2.0 



Multiplicative, C/V =2.0 




Figure 10: Empirical results for synthetic BNs under different experimental conditions for OperatorSGS. 
The additive search operator AG is used in the column to the left, the multiplicative search operator MG 
is used in the column to the right. The top row is for C/V = 2.0 and the bottom row is for C/V — 3.4. The 
restart parameter varies from MAX-FLIPS = 8 to MAX-FLIPS = 4096 as shown on the a/-axis. The noise 
probability p = pn is varied according to the labels. Each data point represents the sample mean of 10,000 
runs: 100 runs for each of 100 BNs. 
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Figure 11: Empirical results for OperatorSGS for the BNs Muninl, Munin2, Water, and Mildew under 
different experimental conditions. The additive probability pa, or the probability of applying an additive 
search operator, varies from pa = 0 to pa = 1 as shown on the x-axis. The noise probability p = p jy is 
varied from p^ = 0.1 to pn = 0.9 according to the labels. Each data point represents the sample mean of 
1000 runs. Clearly, varying pa and pn has a significant impact on run time. 
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Mildew, noise probability p = 0.3 



Mildew, noise probability p = 0.4 



Mildew, noise probability p = 0.5 



Mildew, noise probability p = 0.6 



Figure 12: Empirical results for the Mildew BN under different experimental conditions for OperatorSGS. 
The probability pa of applying an additive search operator, varying from pa = 0.1 to pa = 0.3, is shown 
on the x-axis. The noise probability p = pn is systematically varied from p n = 0.3 (top left) to pn = 0.6 
(bottom right), and the restart parameter MAX-FLIPS is varied according to the labels. Each data point 
represents the sample mean of 1,000 runs. Clearly, varying pa, Pn, and MAX-FLIPS has a significant impact 
on run time. 
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